It is basically an average or mean additional number of years a person is expected to live.
There are many factors that influence the LIFE EXPECTANCY of an individual or the population or the sample.
We collected a data set from a website termed as Kaggle. The data provides Statistical Analysis on factors influencing Life Expectancy by the World Health Organization (WHO). Life Expectancy of a population or a sample or an individual depends on several factors such as demographic situations, economic situations, moratlity rate, income composition and many more. From several research and health factors, it has been found that immunization plays a significant role in the life expectancy; few immunization that we would consider from the collected Kagggle dataset are Hepatitis-B, Polio and Diphtheria. Our dataset is based on different countries. Therefore, this highlights that we would get an insight of how the life expectancy of the population is different based on several different factors we have in our dataframe for different countries in terms of low, high, average and many other statistical analysis. The information we get would help us to gain knowledge about which country needs to focus more on improvement and growth and which country is doing good and innovating well for the well-being of their population and people.
IMPORTING the required and necessary python or pandas libraries for our final project analysis and experimentations!!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from scipy.stats import shapiro
import scipy.stats as stats
from numpy.random import randn
from scipy.stats import normaltest
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn import metrics
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from statsmodels.stats import weightstats as stests
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.metrics import classification_report
import statsmodels.formula.api as smf
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import shapiro
from seaborn import load_dataset, pairplot
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.linear_model import LogisticRegression
data = pd.read_csv("Life Expectancy Data.csv")
data_copy = data.copy()
data2 = data.copy()
data.head()
| Country | Year | Status | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | ... | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 1-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
| 1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
| 2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
| 3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
| 4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
5 rows × 22 columns
data.head(): This prints or displays first five (5) rows or records of our dataframe by default.
data.tail()
| Country | Year | Status | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | ... | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 1-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2933 | Zimbabwe | 2004 | Developing | 44.3 | 723.0 | 27 | 4.36 | 0.0 | 68.0 | 31 | ... | 67.0 | 7.13 | 65.0 | 33.6 | 454.366654 | 12777511.0 | 9.4 | 9.4 | 0.407 | 9.2 |
| 2934 | Zimbabwe | 2003 | Developing | 44.5 | 715.0 | 26 | 4.06 | 0.0 | 7.0 | 998 | ... | 7.0 | 6.52 | 68.0 | 36.7 | 453.351155 | 12633897.0 | 9.8 | 9.9 | 0.418 | 9.5 |
| 2935 | Zimbabwe | 2002 | Developing | 44.8 | 73.0 | 25 | 4.43 | 0.0 | 73.0 | 304 | ... | 73.0 | 6.53 | 71.0 | 39.8 | 57.348340 | 125525.0 | 1.2 | 1.3 | 0.427 | 10.0 |
| 2936 | Zimbabwe | 2001 | Developing | 45.3 | 686.0 | 25 | 1.72 | 0.0 | 76.0 | 529 | ... | 76.0 | 6.16 | 75.0 | 42.1 | 548.587312 | 12366165.0 | 1.6 | 1.7 | 0.427 | 9.8 |
| 2937 | Zimbabwe | 2000 | Developing | 46.0 | 665.0 | 24 | 1.68 | 0.0 | 79.0 | 1483 | ... | 78.0 | 7.10 | 78.0 | 43.5 | 547.358878 | 12222251.0 | 11.0 | 11.2 | 0.434 | 9.8 |
5 rows × 22 columns
data.tail(): This prints or displays last five (5) rows or records of our dataframe by default.
data.shape
(2938, 22)
'The above results delivers that our dataframe has 1987 rows and 22 columns.'
data.columns
Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
' thinness 1-19 years', ' thinness 5-9 years',
'Income composition of resources', 'Schooling'],
dtype='object')
We have 22 columns in our dataset and columns might or might not be related to each other dependently or independently which we would further analyze in our exploration process of Machine Learning/ Data Science using several techniques.
data.columns = ['Country', 'Year', 'Status', 'Life_Expectancy', 'Adult_Mortality','Infant_deaths', 'Alcohol',
'Percentage_expenditure', 'Hepatitis_B','Measles','BMI', 'under-five-deaths', 'Polio',
'Total_expenditure','Diphtheria', 'HIV/AIDS', 'GDP', 'Population','thinness_1-19_years',
'thinness_5-9_years','Income_composition_of_resources', 'Schooling']
data_copy.columns = ['Country', 'Year', 'Status', 'Life_Expectancy', 'Adult_Mortality','Infant_deaths', 'Alcohol',
'Percentage_expenditure', 'Hepatitis_B','Measles','BMI', 'under-five-deaths', 'Polio',
'Total_expenditure','Diphtheria', 'HIV/AIDS', 'GDP', 'Population','thinness_1-19_years',
'thinness_5-9_years','Income_composition_of_resources', 'Schooling']
data2.columns = ['Country', 'Year', 'Status', 'Life_Expectancy', 'Adult_Mortality','Infant_deaths', 'Alcohol',
'Percentage_expenditure', 'Hepatitis_B','Measles','BMI', 'under-five-deaths', 'Polio',
'Total_expenditure','Diphtheria', 'HIV/AIDS', 'GDP', 'Population','thinness_1-19_years',
'thinness_5-9_years','Income_composition_of_resources', 'Schooling']
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2938 entries, 0 to 2937 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 2938 non-null object 1 Year 2938 non-null int64 2 Status 2938 non-null object 3 Life_Expectancy 2928 non-null float64 4 Adult_Mortality 2928 non-null float64 5 Infant_deaths 2938 non-null int64 6 Alcohol 2744 non-null float64 7 Percentage_expenditure 2938 non-null float64 8 Hepatitis_B 2385 non-null float64 9 Measles 2938 non-null int64 10 BMI 2904 non-null float64 11 under-five-deaths 2938 non-null int64 12 Polio 2919 non-null float64 13 Total_expenditure 2712 non-null float64 14 Diphtheria 2919 non-null float64 15 HIV/AIDS 2938 non-null float64 16 GDP 2490 non-null float64 17 Population 2286 non-null float64 18 thinness_1-19_years 2904 non-null float64 19 thinness_5-9_years 2904 non-null float64 20 Income_composition_of_resources 2771 non-null float64 21 Schooling 2775 non-null float64 dtypes: float64(16), int64(4), object(2) memory usage: 505.1+ KB
data.info(): It provides of the information of what type of data do our columns of the dataframe hold, for instance; our dataframe hold data of type object, int64 and float64'
data.describe()
| Year | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | BMI | under-five-deaths | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2938.000000 | 2928.000000 | 2928.000000 | 2938.000000 | 2744.000000 | 2938.000000 | 2385.000000 | 2938.000000 | 2904.000000 | 2938.000000 | 2919.000000 | 2712.00000 | 2919.000000 | 2938.000000 | 2490.000000 | 2.286000e+03 | 2904.000000 | 2904.000000 | 2771.000000 | 2775.000000 |
| mean | 2007.518720 | 69.224932 | 164.796448 | 30.303948 | 4.602861 | 738.251295 | 80.940461 | 2419.592240 | 38.321247 | 42.035739 | 82.550188 | 5.93819 | 82.324084 | 1.742103 | 7483.158469 | 1.275338e+07 | 4.839704 | 4.870317 | 0.627551 | 11.992793 |
| std | 4.613841 | 9.523867 | 124.292079 | 117.926501 | 4.052413 | 1987.914858 | 25.070016 | 11467.272489 | 20.044034 | 160.445548 | 23.428046 | 2.49832 | 23.716912 | 5.077785 | 14270.169342 | 6.101210e+07 | 4.420195 | 4.508882 | 0.210904 | 3.358920 |
| min | 2000.000000 | 36.300000 | 1.000000 | 0.000000 | 0.010000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 3.000000 | 0.37000 | 2.000000 | 0.100000 | 1.681350 | 3.400000e+01 | 0.100000 | 0.100000 | 0.000000 | 0.000000 |
| 25% | 2004.000000 | 63.100000 | 74.000000 | 0.000000 | 0.877500 | 4.685343 | 77.000000 | 0.000000 | 19.300000 | 0.000000 | 78.000000 | 4.26000 | 78.000000 | 0.100000 | 463.935626 | 1.957932e+05 | 1.600000 | 1.500000 | 0.493000 | 10.100000 |
| 50% | 2008.000000 | 72.100000 | 144.000000 | 3.000000 | 3.755000 | 64.912906 | 92.000000 | 17.000000 | 43.500000 | 4.000000 | 93.000000 | 5.75500 | 93.000000 | 0.100000 | 1766.947595 | 1.386542e+06 | 3.300000 | 3.300000 | 0.677000 | 12.300000 |
| 75% | 2012.000000 | 75.700000 | 228.000000 | 22.000000 | 7.702500 | 441.534144 | 97.000000 | 360.250000 | 56.200000 | 28.000000 | 97.000000 | 7.49250 | 97.000000 | 0.800000 | 5910.806335 | 7.420359e+06 | 7.200000 | 7.200000 | 0.779000 | 14.300000 |
| max | 2015.000000 | 89.000000 | 723.000000 | 1800.000000 | 17.870000 | 19479.911610 | 99.000000 | 212183.000000 | 87.300000 | 2500.000000 | 99.000000 | 17.60000 | 99.000000 | 50.600000 | 119172.741800 | 1.293859e+09 | 27.700000 | 28.600000 | 0.948000 | 20.700000 |
data.describe(): This basically provided us overall statistical information for all the columns fo numeric datatypes with different statistical analysis such as mean, median, standard deviation, minimum values, maximum values, 25%, 75%, total number of values.
Dealing with missing values is an important step towards Machine Learning algorithm and Data Science application. Missing values in the sense null or nan values in our dataset which could affect the performance of our analysis we do using different techniques which therefore might end up giving not effective or accurate result that we want or expect to get. Therefore, Dealing with missing values is step of Data Cleaning to work with clean and pretty data forward.
data.isnull().sum()
Country 0 Year 0 Status 0 Life_Expectancy 10 Adult_Mortality 10 Infant_deaths 0 Alcohol 194 Percentage_expenditure 0 Hepatitis_B 553 Measles 0 BMI 34 under-five-deaths 0 Polio 19 Total_expenditure 226 Diphtheria 19 HIV/AIDS 0 GDP 448 Population 652 thinness_1-19_years 34 thinness_5-9_years 34 Income_composition_of_resources 167 Schooling 163 dtype: int64
The above command gives us the total number of rows for that particular column, how many rows have nothing or null values. For instance, our column termed as Life_Expectancy has 10 rows with null values where as column termed as Country has no missing values means all of its rows has the name of the countries.
data.isnull().sum()*100/len(data)
Country 0.000000 Year 0.000000 Status 0.000000 Life_Expectancy 0.340368 Adult_Mortality 0.340368 Infant_deaths 0.000000 Alcohol 6.603131 Percentage_expenditure 0.000000 Hepatitis_B 18.822328 Measles 0.000000 BMI 1.157250 under-five-deaths 0.000000 Polio 0.646698 Total_expenditure 7.692308 Diphtheria 0.646698 HIV/AIDS 0.000000 GDP 15.248468 Population 22.191967 thinness_1-19_years 1.157250 thinness_5-9_years 1.157250 Income_composition_of_resources 5.684139 Schooling 5.547992 dtype: float64
len(pd.unique(data['Country']))
193
pd.unique(data['Country'])
array(['Afghanistan', 'Albania', 'Algeria', 'Angola',
'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
'Botswana', 'Brazil', 'Brunei Darussalam', 'Bulgaria',
'Burkina Faso', 'Burundi', "Côte d'Ivoire", 'Cabo Verde',
'Cambodia', 'Cameroon', 'Canada', 'Central African Republic',
'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo',
'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus',
'Czechia', "Democratic People's Republic of Korea",
'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia',
'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala',
'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras',
'Hungary', 'Iceland', 'India', 'Indonesia',
'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Israel', 'Italy',
'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati',
'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic",
'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithuania',
'Luxembourg', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives',
'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius',
'Mexico', 'Micronesia (Federated States of)', 'Monaco', 'Mongolia',
'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
'Nauru', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua',
'Niger', 'Nigeria', 'Niue', 'Norway', 'Oman', 'Pakistan', 'Palau',
'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
'Poland', 'Portugal', 'Qatar', 'Republic of Korea',
'Republic of Moldova', 'Romania', 'Russian Federation', 'Rwanda',
'Saint Kitts and Nevis', 'Saint Lucia',
'Saint Vincent and the Grenadines', 'Samoa', 'San Marino',
'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia',
'Solomon Islands', 'Somalia', 'South Africa', 'South Sudan',
'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden',
'Switzerland', 'Syrian Arab Republic', 'Tajikistan', 'Thailand',
'The former Yugoslav republic of Macedonia', 'Timor-Leste', 'Togo',
'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey',
'Turkmenistan', 'Tuvalu', 'Uganda', 'Ukraine',
'United Arab Emirates',
'United Kingdom of Great Britain and Northern Ireland',
'United Republic of Tanzania', 'United States of America',
'Uruguay', 'Uzbekistan', 'Vanuatu',
'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Yemen',
'Zambia', 'Zimbabwe'], dtype=object)
len(pd.unique(data['Year']))
16
pd.unique(data['Year'])
array([2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005,
2004, 2003, 2002, 2001, 2000], dtype=int64)
plt.figure(figsize=(12,8))
graph = sns.barplot(x="Year",y = "GDP", data=data, palette="prism")
plt.xticks(size=15)
plt.yticks(size=15)
plt.ylabel("GDP")
plt.xlabel("Year")
plt.title("Bar Plot: Relationship between Year and GDP")
plt.show()
Above Plot:
data2002 = data.loc[data['Year'] == 2002]
data2002.describe()
| Year | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | BMI | under-five-deaths | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 183.0 | 183.000000 | 183.000000 | 183.000000 | 182.000000 | 183.000000 | 113.000000 | 183.000000 | 181.000000 | 183.000000 | 181.000000 | 179.000000 | 181.000000 | 183.000000 | 155.000000 | 1.430000e+02 | 181.000000 | 181.000000 | 173.000000 | 173.000000 |
| mean | 2002.0 | 67.351366 | 171.437158 | 35.584699 | 4.660934 | 476.794487 | 76.522124 | 3204.754098 | 37.110497 | 50.300546 | 79.679558 | 5.687989 | 78.883978 | 2.573770 | 4599.303043 | 6.625328e+06 | 5.166298 | 5.118232 | 0.568006 | 11.140462 |
| std | 0.0 | 10.062469 | 133.472330 | 141.030361 | 3.967477 | 1162.688712 | 27.957269 | 8591.234477 | 18.449467 | 194.206932 | 25.188557 | 2.160620 | 24.627735 | 7.002383 | 8541.264026 | 1.578326e+07 | 4.738111 | 4.860138 | 0.246083 | 3.655926 |
| min | 2002.0 | 44.000000 | 3.000000 | 0.000000 | 0.010000 | 0.000000 | 4.000000 | 0.000000 | 1.000000 | 0.000000 | 4.000000 | 1.210000 | 4.000000 | 0.100000 | 1.681350 | 2.970000e+02 | 0.100000 | 0.100000 | 0.000000 | 0.000000 |
| 25% | 2002.0 | 59.350000 | 74.000000 | 0.000000 | 1.242500 | 5.161358 | 66.000000 | 0.000000 | 18.000000 | 0.500000 | 74.000000 | 4.230000 | 72.000000 | 0.100000 | 328.517666 | 1.414120e+05 | 1.600000 | 1.500000 | 0.444000 | 8.800000 |
| 50% | 2002.0 | 71.400000 | 146.000000 | 3.000000 | 3.745000 | 49.735114 | 88.000000 | 36.000000 | 42.000000 | 4.000000 | 91.000000 | 5.450000 | 88.000000 | 0.100000 | 899.555686 | 1.366667e+06 | 3.400000 | 3.300000 | 0.632000 | 11.800000 |
| 75% | 2002.0 | 74.800000 | 234.500000 | 25.000000 | 7.382500 | 242.749329 | 95.000000 | 1518.000000 | 54.000000 | 35.000000 | 97.000000 | 7.135000 | 96.000000 | 1.550000 | 3278.108372 | 6.627962e+06 | 7.700000 | 7.600000 | 0.745000 | 13.500000 |
| max | 2002.0 | 84.000000 | 699.000000 | 1700.000000 | 14.170000 | 6853.628494 | 99.000000 | 58341.000000 | 69.700000 | 2300.000000 | 99.000000 | 14.550000 | 99.000000 | 49.900000 | 41336.721920 | 1.446541e+08 | 27.400000 | 28.400000 | 0.916000 | 20.100000 |
As we saw above that 2002 had the lowest GDP when we tried to find the relation between GDP and year. Here we created a separated dataframe and extarcted all the information from our original dataframe for only year 2002, and used describe commnand to find our the statistical information of the year 2002 for different countries and repsective other factors.
data2008 = data.loc[data['Year'] == 2008]
data2008.describe()
| Year | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | BMI | under-five-deaths | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 183.0 | 183.000000 | 183.000000 | 183.000000 | 182.000000 | 183.000000 | 163.000000 | 183.000000 | 181.000000 | 183.000000 | 182.000000 | 180.000000 | 182.000000 | 183.000000 | 156.000000 | 1.430000e+02 | 181.000000 | 181.000000 | 173.000000 | 173.000000 |
| mean | 2008.0 | 69.427869 | 174.519126 | 29.568306 | 5.007088 | 1095.802669 | 83.644172 | 1523.229508 | 38.225414 | 41.322404 | 85.565934 | 5.723056 | 84.857143 | 1.797268 | 10604.040364 | 9.487742e+06 | 4.907182 | 4.941436 | 0.645717 | 12.176301 |
| std | 0.0 | 9.202612 | 120.419771 | 111.310646 | 4.098128 | 2615.641873 | 21.084686 | 10375.663062 | 20.183855 | 155.823759 | 20.138019 | 2.558672 | 20.721554 | 4.854741 | 19256.603682 | 2.470296e+07 | 4.341403 | 4.555363 | 0.188646 | 3.134349 |
| min | 2008.0 | 46.200000 | 1.000000 | 0.000000 | 0.010000 | 0.000000 | 9.000000 | 0.000000 | 2.300000 | 0.000000 | 3.000000 | 0.740000 | 5.000000 | 0.100000 | 5.668726 | 4.300000e+01 | 0.100000 | 0.100000 | 0.000000 | 0.000000 |
| 25% | 2008.0 | 62.700000 | 87.500000 | 0.000000 | 1.375000 | 16.904198 | 81.000000 | 0.000000 | 19.600000 | 0.000000 | 83.000000 | 3.927500 | 81.250000 | 0.100000 | 471.429499 | 2.144335e+05 | 1.700000 | 1.700000 | 0.511000 | 10.300000 |
| 50% | 2008.0 | 72.400000 | 157.000000 | 3.000000 | 4.185000 | 93.367890 | 92.000000 | 4.000000 | 43.400000 | 4.000000 | 93.000000 | 5.695000 | 93.000000 | 0.100000 | 1788.274085 | 1.946351e+06 | 3.400000 | 3.400000 | 0.697000 | 12.400000 |
| 75% | 2008.0 | 75.350000 | 234.500000 | 22.000000 | 8.292500 | 529.524032 | 97.000000 | 134.500000 | 56.600000 | 28.000000 | 97.000000 | 7.332500 | 97.000000 | 0.850000 | 8513.580869 | 8.266880e+06 | 7.400000 | 7.300000 | 0.779000 | 14.300000 |
| max | 2008.0 | 89.000000 | 632.000000 | 1300.000000 | 16.990000 | 18961.348600 | 99.000000 | 131441.000000 | 74.100000 | 1800.000000 | 99.000000 | 16.200000 | 99.000000 | 40.200000 | 114293.843300 | 2.361593e+08 | 27.000000 | 27.900000 | 0.936000 | 19.500000 |
As we saw above that 2008 had the highest GDP when we tried to find the relation between GDP and year. Here we created a separated dataframe and extarcted all the information from our original dataframe for only year 2008, and used describe commnand to find our the statistical information of the year 2008 for different countries and repsective other factors.
plt.figure(figsize=(12,8))
graph = sns.barplot(x="Year",y = "Life_Expectancy", data=data, palette="plasma")
plt.xticks(size=15)
plt.yticks(size=15)
plt.ylabel("Life_Expectancy")
plt.xlabel("Year")
plt.title("Bar Plot: Relationship between Year and Life Expectancy")
plt.show()
Above Plot:
LIFE EXPECTANCY would be our independent variable and is our main consideration for the analysis or during our data exploration. Analyzing the dataset that we have and exploring the relationships that could be formed with the column that is given to us; I believe Life Excpectancy would be one of the columns I would take as an important column for consideration and try to understand and observe the relationship between life expectany with other columns from our original dataset or dataframe. Along with analyzing the dataset with life expectancy and relations with other columns, I would be working with identifying outliers as well and move forward to removing outliers as part/process of data curation or data cleaning.
data['Life_Expectancy'].hist()
<AxesSubplot: >
plt.figure(figsize=(10,10))
plt.bar(data.groupby('Status')['Status'].count().index, data.groupby('Status')['Life_Expectancy'].mean(), color="red")
plt.xlabel("Status",fontsize=12)
plt.ylabel("Avg Life_Expectancy",fontsize=12)
plt.title("Life_Expectancy with respect to Status")
plt.show()
Above Plot:
# Life_Expectancy with respect to Year using bar plot.
plt.figure(figsize=(10,10))
plt.bar(data.groupby('Year')['Year'].count().index, data.groupby('Year')['Life_Expectancy'].mean(),color='green',alpha=0.65)
plt.xlabel("Year",fontsize=12)
plt.ylabel("Avg Life_Expectancy",fontsize=12)
plt.title("Life_Expectancy with respect to Year")
plt.show()
plt.figure(figsize=(8, 8))
plt.scatter(data["Schooling"], data["Income_composition_of_resources"], color="orange")
plt.title("Income_composition_of_resources and Schooling")
plt.show()
Above Plot:
*Observations:*
plt.figure(figsize=(8, 8))
plt.scatter(data["BMI"], data["Life_Expectancy"], color="brown")
plt.title("BMI and Life_Expectancy")
plt.show()
Above Plot:
data.groupby('Life_Expectancy').count().plot(kind='bar')
<AxesSubplot: xlabel='Life_Expectancy'>
data1 = pd.DataFrame(data = np.random.random(size=(4,22)), columns = ['Country', 'Year', 'Status', 'Life_Expectancy',
'Adult_Mortality','Infant_deaths', 'Alcohol', 'Percentage_expenditure', 'Hepatitis_B','Measles','BMI',
'under-five-deaths ', 'Polio', 'Total_expenditure','Diphtheria', 'HIV/AIDS', 'GDP', 'Population',
'thinness_1-19_years', 'thinness_5-9_years','Income_composition_of_resources', 'Schooling'])
fig, ax = plt.subplots(figsize=(55,20))
sns.boxplot(x="variable", y="value", data=pd.melt(data1)).set(title='Box Plot for all the columns of our dataset')
[Text(0.5, 1.0, 'Box Plot for all the columns of our dataset')]
Above Plot:
data['Life_Expectancy'].plot(kind='hist', bins=30)
<AxesSubplot: ylabel='Frequency'>
data['Life_Expectancy'].plot(kind='box', title='BOX PLOT: Life Expectancies before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['Adult_Mortality'].plot(kind='box', title='BOX PLOT: Adult_Mortality before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['Infant_deaths'].plot(kind='box', title='BOX PLOT: Infant_deaths before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['Alcohol'].plot(kind='box', title='BOX PLOT: Alcohol before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['Percentage_expenditure'].plot(kind='box', title='BOX PLOT: Percentage_expenditure before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['Hepatitis_B'].plot(kind='box', title='BOX PLOT: Hepatitis_B before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['Measles'].plot(kind='box', title='BOX PLOT: Measles before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['BMI'].plot(kind='box', title='BOX PLOT: BMI before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['under-five-deaths'].plot(kind='box', title='BOX PLOT: under-five-deaths before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['Polio'].plot(kind='box', title='BOX PLOT: Polio before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['Total_expenditure'].plot(kind='box', title='BOX PLOT: Total_expenditure before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['HIV/AIDS'].plot(kind='box', title='BOX PLOT: HIV/AIDS before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['Diphtheria'].plot(kind='box', title='BOX PLOT: Diphtheria before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['thinness_1-19_years'].plot(kind='box', title='BOX PLOT: thinness_1-19_years before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['thinness_5-9_years'].plot(kind='box', title='BOX PLOT: thinness_5-9_years before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
data['GDP'].plot(kind='box', title='BOX PLOT: GDP before removing outliers',figsize=(7,4))
plt.show()
Above Plot:
countries = data.Country.unique()
columns = ['Year', 'Status', 'Life_Expectancy', 'Adult_Mortality','Infant_deaths', 'Alcohol',
'Percentage_expenditure', 'Hepatitis_B','Measles','BMI', 'under-five-deaths', 'Polio',
'Total_expenditure','Diphtheria', 'HIV/AIDS', 'GDP', 'Population','thinness_1-19_years',
'thinness_5-9_years','Income_composition_of_resources', 'Schooling']
for c in countries:
data.loc[data['Country'] == c,columns] = data.loc[data['Country'] == c,columns].interpolate()
data[np.isnan(data['Life_Expectancy'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 624 | Cook Islands | 2013 | Developing | NaN | NaN | 0 | 0.01 | 0.000000 | 98.0 | 0 | ... | 98.0 | 3.58 | 98.0 | 0.1 | NaN | NaN | 0.1 | 0.1 | NaN | NaN |
| 769 | Dominica | 2013 | Developing | NaN | NaN | 0 | 0.01 | 11.419555 | 96.0 | 0 | ... | 96.0 | 5.58 | 96.0 | 0.1 | 722.756650 | NaN | 2.7 | 2.6 | 0.721 | 12.7 |
| 1650 | Marshall Islands | 2013 | Developing | NaN | NaN | 0 | 0.01 | 871.878317 | 8.0 | 0 | ... | 79.0 | 17.24 | 79.0 | 0.1 | 3617.752354 | NaN | 0.1 | 0.1 | NaN | 0.0 |
| 1715 | Monaco | 2013 | Developing | NaN | NaN | 0 | 0.01 | 0.000000 | 99.0 | 0 | ... | 99.0 | 4.30 | 99.0 | 0.1 | NaN | NaN | NaN | NaN | NaN | NaN |
| 1812 | Nauru | 2013 | Developing | NaN | NaN | 0 | 0.01 | 15.606596 | 87.0 | 0 | ... | 87.0 | 4.65 | 87.0 | 0.1 | 136.183210 | NaN | 0.1 | 0.1 | NaN | 9.6 |
| 1909 | Niue | 2013 | Developing | NaN | NaN | 0 | 0.01 | 0.000000 | 99.0 | 0 | ... | 99.0 | 7.20 | 99.0 | 0.1 | NaN | NaN | 0.1 | 0.1 | NaN | NaN |
| 1958 | Palau | 2013 | Developing | NaN | NaN | 0 | NaN | 344.690631 | 99.0 | 0 | ... | 99.0 | 9.27 | 99.0 | 0.1 | 1932.122370 | 292.0 | 0.1 | 0.1 | 0.779 | 14.2 |
| 2167 | Saint Kitts and Nevis | 2013 | Developing | NaN | NaN | 0 | 8.54 | 0.000000 | 97.0 | 0 | ... | 96.0 | 6.14 | 96.0 | 0.1 | NaN | NaN | 3.7 | 3.6 | 0.749 | 13.4 |
| 2216 | San Marino | 2013 | Developing | NaN | NaN | 0 | 0.01 | 0.000000 | 69.0 | 0 | ... | 69.0 | 6.50 | 69.0 | 0.1 | NaN | NaN | NaN | NaN | NaN | 15.1 |
| 2713 | Tuvalu | 2013 | Developing | NaN | NaN | 0 | 0.01 | 78.281203 | 9.0 | 0 | ... | 9.0 | 16.61 | 9.0 | 0.1 | 3542.135890 | 1819.0 | 0.2 | 0.1 | NaN | 0.0 |
10 rows × 22 columns
data = data.drop(data.index[[624, 769, 1650,1715,1812,1909,1958,2167,2216,2713]])
data[np.isnan(data['Adult_Mortality'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data[np.isnan(data['Infant_deaths'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data[np.isnan(data.Alcohol)]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | Algeria | 2015 | Developing | 75.6 | 19.0 | 21 | NaN | 0.0 | 95.0 | 63 | ... | 95.0 | NaN | 95.0 | 0.1 | 4132.762920 | 39871528.0 | 6.0 | 5.8 | 0.743 | 14.4 |
| 48 | Angola | 2015 | Developing | 52.4 | 335.0 | 66 | NaN | 0.0 | 64.0 | 118 | ... | 7.0 | NaN | 64.0 | 1.9 | 3695.793748 | 2785935.0 | 8.3 | 8.2 | 0.531 | 11.4 |
| 64 | Antigua and Barbuda | 2015 | Developing | 76.4 | 13.0 | 0 | NaN | 0.0 | 99.0 | 0 | ... | 86.0 | NaN | 99.0 | 0.2 | 13566.954100 | NaN | 3.3 | 3.3 | 0.784 | 13.9 |
| 80 | Argentina | 2015 | Developing | 76.3 | 116.0 | 8 | NaN | 0.0 | 94.0 | 0 | ... | 93.0 | NaN | 94.0 | 0.1 | 13467.123600 | 43417765.0 | 1.0 | 0.9 | 0.826 | 17.3 |
| 96 | Armenia | 2015 | Developing | 74.8 | 118.0 | 1 | NaN | 0.0 | 94.0 | 33 | ... | 96.0 | NaN | 94.0 | 0.1 | 369.654776 | 291695.0 | 2.1 | 2.2 | 0.741 | 12.7 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2858 | Venezuela (Bolivarian Republic of) | 2015 | Developing | 74.1 | 157.0 | 9 | NaN | 0.0 | 87.0 | 0 | ... | 87.0 | NaN | 87.0 | 0.1 | NaN | NaN | 1.6 | 1.5 | 0.769 | 14.3 |
| 2874 | Viet Nam | 2015 | Developing | 76.0 | 127.0 | 28 | NaN | 0.0 | 97.0 | 256 | ... | 97.0 | NaN | 97.0 | 0.1 | NaN | NaN | 14.2 | 14.5 | 0.678 | 12.6 |
| 2890 | Yemen | 2015 | Developing | 65.7 | 224.0 | 37 | NaN | 0.0 | 69.0 | 468 | ... | 63.0 | NaN | 69.0 | 0.1 | NaN | NaN | 13.6 | 13.4 | 0.499 | 9.0 |
| 2906 | Zambia | 2015 | Developing | 61.8 | 33.0 | 27 | NaN | 0.0 | 9.0 | 9 | ... | 9.0 | NaN | 9.0 | 4.1 | 1313.889646 | 161587.0 | 6.3 | 6.1 | 0.576 | 12.5 |
| 2922 | Zimbabwe | 2015 | Developing | 67.0 | 336.0 | 22 | NaN | 0.0 | 87.0 | 0 | ... | 88.0 | NaN | 87.0 | 6.2 | 118.693830 | 15777451.0 | 5.6 | 5.5 | 0.507 | 10.3 |
192 rows × 22 columns
data['Alcohol'] = data['Alcohol'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['Alcohol'])
data[np.isnan(data['Percentage_expenditure'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data[np.isnan(data['Hepatitis_B'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 738 | Denmark | 2014 | Developed | 84.0 | 73.0 | 0 | 9.64 | 10468.762920 | NaN | 27 | ... | 94.0 | 1.80 | 94.0 | 0.1 | 62425.53920 | 5643475.0 | 1.1 | 0.9 | 0.926 | 19.2 |
| 739 | Denmark | 2013 | Developed | 81.0 | 75.0 | 0 | 9.50 | 10261.763000 | NaN | 17 | ... | 94.0 | 11.25 | 94.0 | 0.1 | 61191.19263 | 5614932.0 | 1.1 | 0.9 | 0.924 | 18.7 |
| 740 | Denmark | 2012 | Developed | 80.0 | 76.0 | 0 | 9.26 | 928.417079 | NaN | 2 | ... | 94.0 | 1.98 | 94.0 | 0.1 | 5857.52100 | 5591572.0 | 1.1 | 0.9 | 0.922 | 18.4 |
| 741 | Denmark | 2011 | Developed | 79.7 | 79.0 | 0 | 10.47 | 10251.108720 | NaN | 84 | ... | 91.0 | 1.87 | 91.0 | 0.1 | 61753.66700 | 557572.0 | 1.1 | 0.9 | 0.910 | 16.9 |
| 742 | Denmark | 2010 | Developed | 79.2 | 84.0 | 0 | 10.28 | 954.486593 | NaN | 5 | ... | 9.0 | 11.80 | 9.0 | 0.1 | 5841.41122 | 5547683.0 | 1.1 | 0.9 | 0.906 | 16.8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2773 | United Kingdom of Great Britain and Northern I... | 2004 | Developed | 78.8 | 83.0 | 4 | 12.22 | 0.000000 | NaN | 189 | ... | 92.0 | 7.98 | 92.0 | 0.1 | NaN | NaN | 0.7 | 0.5 | NaN | NaN |
| 2774 | United Kingdom of Great Britain and Northern I... | 2003 | Developed | 78.3 | 86.0 | 4 | 11.85 | 0.000000 | NaN | 460 | ... | 91.0 | 7.81 | 91.0 | 0.1 | NaN | NaN | 0.7 | 0.5 | NaN | NaN |
| 2775 | United Kingdom of Great Britain and Northern I... | 2002 | Developed | 78.2 | 87.0 | 4 | 11.44 | 0.000000 | NaN | 314 | ... | 91.0 | 7.57 | 91.0 | 0.1 | NaN | NaN | 0.7 | 0.5 | NaN | NaN |
| 2776 | United Kingdom of Great Britain and Northern I... | 2001 | Developed | 78.0 | 88.0 | 4 | 10.91 | 0.000000 | NaN | 73 | ... | 91.0 | 7.31 | 91.0 | 0.1 | NaN | NaN | 0.7 | 0.5 | NaN | NaN |
| 2777 | United Kingdom of Great Britain and Northern I... | 2000 | Developed | 77.8 | 89.0 | 4 | 10.59 | 0.000000 | NaN | 104 | ... | 91.0 | 6.94 | 91.0 | 0.1 | NaN | NaN | 0.7 | 0.5 | NaN | NaN |
137 rows × 22 columns
data['Hepatitis_B'] = data['Hepatitis_B'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['Hepatitis_B'])
data[np.isnan(data['Measles'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data[np.isnan(data['BMI'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2458 | Sudan | 2014 | Developing | 63.8 | 229.0 | 59 | 0.01 | 253.608651 | 94.0 | 676 | ... | 94.0 | 8.43 | 94.0 | 0.3 | 2176.898290 | 37737913.0 | NaN | NaN | 0.485 | 7.2 |
| 2459 | Sudan | 2013 | Developing | 63.5 | 232.0 | 60 | 0.01 | 227.835321 | 93.0 | 2813 | ... | 93.0 | 8.42 | 93.0 | 0.3 | 1955.667990 | 36849918.0 | NaN | NaN | 0.478 | 7.0 |
| 2460 | Sudan | 2012 | Developing | 63.2 | 235.0 | 61 | 0.01 | 220.522192 | 92.0 | 8523 | ... | 92.0 | 8.20 | 92.0 | 0.3 | 1892.894352 | 3599192.0 | NaN | NaN | 0.468 | 6.8 |
| 2461 | Sudan | 2011 | Developing | 62.7 | 241.0 | 61 | 2.12 | 196.689215 | 93.0 | 5616 | ... | 93.0 | 8.30 | 93.0 | 0.3 | 1666.857757 | 35167314.0 | NaN | NaN | 0.463 | 7.0 |
| 2462 | Sudan | 2010 | Developing | 62.5 | 243.0 | 62 | 1.77 | 172.009788 | 75.0 | 680 | ... | 9.0 | 7.97 | 9.0 | 0.3 | 1476.478870 | 34385963.0 | NaN | NaN | 0.461 | 7.0 |
| 2463 | Sudan | 2009 | Developing | 62.0 | 248.0 | 63 | 1.99 | 17.053693 | 72.0 | 68 | ... | 81.0 | 8.40 | 81.0 | 0.3 | 1226.884381 | 3365619.0 | NaN | NaN | 0.456 | 6.8 |
| 2464 | Sudan | 2008 | Developing | 61.8 | 251.0 | 64 | 2.01 | 128.636271 | 78.0 | 129 | ... | 85.0 | 8.17 | 86.0 | 0.3 | 1291.528826 | 32955496.0 | NaN | NaN | 0.444 | 6.3 |
| 2465 | Sudan | 2007 | Developing | 61.4 | 254.0 | 65 | 2.01 | 86.131669 | 78.0 | 327 | ... | 84.0 | 4.72 | 84.0 | 0.3 | 1115.695200 | 32282526.0 | NaN | NaN | 0.440 | 6.4 |
| 2466 | Sudan | 2006 | Developing | 61.0 | 26.0 | 66 | 1.90 | 60.336857 | 6.0 | 228 | ... | 77.0 | 3.93 | 78.0 | 0.2 | 893.879364 | 316764.0 | NaN | NaN | 0.430 | 6.2 |
| 2467 | Sudan | 2005 | Developing | 67.0 | 261.0 | 66 | 1.55 | 37.590396 | 22.0 | 1374 | ... | 78.0 | 3.18 | 78.0 | 0.2 | 679.753995 | 3911914.0 | NaN | NaN | 0.423 | 6.1 |
| 2468 | Sudan | 2004 | Developing | 59.7 | 278.0 | 68 | 1.59 | 37.044800 | 22.0 | 9562 | ... | 74.0 | 3.39 | 74.0 | 0.2 | 565.569459 | 3186341.0 | NaN | NaN | 0.415 | 5.7 |
| 2469 | Sudan | 2003 | Developing | 59.6 | 278.0 | 69 | 1.74 | 35.352647 | 22.0 | 4381 | ... | 69.0 | 3.18 | 69.0 | 0.2 | 477.738478 | 29435944.0 | NaN | NaN | 0.409 | 5.6 |
| 2470 | Sudan | 2002 | Developing | 59.4 | 277.0 | 70 | 1.59 | 30.622875 | 22.0 | 4529 | ... | 6.0 | 2.95 | 6.0 | 0.2 | 412.151756 | 28679565.0 | NaN | NaN | 0.403 | 5.6 |
| 2471 | Sudan | 2001 | Developing | 58.9 | 283.0 | 71 | 1.81 | 28.880697 | 22.0 | 4362 | ... | 66.0 | 2.96 | 66.0 | 0.2 | 377.525445 | 279455.0 | NaN | NaN | 0.399 | 5.6 |
| 2472 | Sudan | 2000 | Developing | 58.6 | 284.0 | 71 | 1.76 | 30.860010 | 22.0 | 2875 | ... | 62.0 | 3.23 | 62.0 | 0.1 | 361.358430 | 2725535.0 | NaN | NaN | 0.394 | 5.5 |
15 rows × 22 columns
data['BMI'] = data['BMI'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['BMI'])
data[np.isnan(data['under-five-deaths'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data[np.isnan(data['Polio'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data[np.isnan(data['Total_expenditure'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 706 | Democratic People's Republic of Korea | 2014 | Developing | 73.0 | 142.0 | 6 | 0.01 | 0.0 | 93.0 | 3 | ... | 99.0 | NaN | 93.0 | 0.1 | NaN | NaN | 4.9 | 4.9 | NaN | NaN |
| 707 | Democratic People's Republic of Korea | 2013 | Developing | 71.0 | 146.0 | 6 | 3.35 | 0.0 | 93.0 | 0 | ... | 99.0 | NaN | 93.0 | 0.1 | NaN | NaN | 5.0 | 5.0 | NaN | NaN |
| 708 | Democratic People's Republic of Korea | 2012 | Developing | 69.8 | 149.0 | 7 | 3.61 | 0.0 | 96.0 | 0 | ... | 99.0 | NaN | 96.0 | 0.1 | NaN | NaN | 5.1 | 5.1 | NaN | NaN |
| 709 | Democratic People's Republic of Korea | 2011 | Developing | 69.4 | 153.0 | 8 | 3.39 | 0.0 | 94.0 | 0 | ... | 99.0 | NaN | 94.0 | 0.1 | NaN | NaN | 5.1 | 5.2 | NaN | NaN |
| 710 | Democratic People's Republic of Korea | 2010 | Developing | 69.0 | 157.0 | 8 | 3.12 | 0.0 | 93.0 | 0 | ... | 99.0 | NaN | 93.0 | 0.1 | NaN | NaN | 5.2 | 5.2 | NaN | NaN |
| 711 | Democratic People's Republic of Korea | 2009 | Developing | 68.7 | 161.0 | 9 | 3.35 | 0.0 | 93.0 | 0 | ... | 98.0 | NaN | 93.0 | 0.1 | NaN | NaN | 5.3 | 5.3 | NaN | NaN |
| 712 | Democratic People's Republic of Korea | 2008 | Developing | 68.6 | 164.0 | 9 | 3.16 | 0.0 | 92.0 | 8 | ... | 98.0 | NaN | 92.0 | 0.1 | NaN | NaN | 5.4 | 5.4 | NaN | NaN |
| 713 | Democratic People's Republic of Korea | 2007 | Developing | 68.5 | 166.0 | 9 | 3.13 | 0.0 | 92.0 | 3550 | ... | 99.0 | NaN | 92.0 | 0.1 | NaN | NaN | 5.5 | 5.5 | NaN | NaN |
| 714 | Democratic People's Republic of Korea | 2006 | Developing | 68.5 | 165.0 | 10 | 3.28 | 0.0 | 96.0 | 0 | ... | 98.0 | NaN | 89.0 | 0.1 | NaN | NaN | 5.6 | 5.6 | NaN | NaN |
| 715 | Democratic People's Republic of Korea | 2005 | Developing | 68.5 | 165.0 | 10 | 3.21 | 0.0 | 92.0 | 0 | ... | 97.0 | NaN | 79.0 | 0.1 | NaN | NaN | 5.7 | 5.7 | NaN | NaN |
| 716 | Democratic People's Republic of Korea | 2004 | Developing | 68.4 | 165.0 | 11 | 3.13 | 0.0 | 98.0 | 0 | ... | 99.0 | NaN | 72.0 | 0.1 | NaN | NaN | 5.7 | 5.7 | NaN | NaN |
| 717 | Democratic People's Republic of Korea | 2003 | Developing | 68.1 | 165.0 | 12 | 3.13 | 0.0 | 27.0 | 0 | ... | 99.0 | NaN | 68.0 | 0.1 | NaN | NaN | 5.8 | 5.8 | NaN | NaN |
| 718 | Democratic People's Republic of Korea | 2002 | Developing | 67.6 | 167.0 | 14 | 3.08 | 0.0 | 27.0 | 0 | ... | 99.0 | NaN | 64.0 | 0.1 | NaN | NaN | 5.9 | 5.9 | NaN | NaN |
| 719 | Democratic People's Republic of Korea | 2001 | Developing | 66.6 | 177.0 | 16 | 2.53 | 0.0 | 27.0 | 0 | ... | 98.0 | NaN | 62.0 | 0.1 | NaN | NaN | 5.9 | 6.0 | NaN | NaN |
| 720 | Democratic People's Republic of Korea | 2000 | Developing | 65.4 | 192.0 | 18 | 3.52 | 0.0 | 27.0 | 0 | ... | 93.0 | NaN | 56.0 | 0.1 | NaN | NaN | 6.0 | 6.0 | NaN | NaN |
| 1845 | New Zealand | 2015 | Developed | 81.6 | 66.0 | 0 | 8.70 | 0.0 | 92.0 | 10 | ... | 92.0 | NaN | 92.0 | 0.1 | 3821.893700 | NaN | 0.4 | 0.3 | 0.913 | 19.2 |
| 2313 | Singapore | 2015 | Developed | 83.1 | 55.0 | 0 | 1.79 | 0.0 | 96.0 | 0 | ... | 96.0 | NaN | 96.0 | 0.1 | 53629.737460 | NaN | 2.2 | 2.2 | 0.924 | 15.4 |
| 2378 | Somalia | 2014 | Developing | 54.3 | 321.0 | 51 | 0.01 | 0.0 | 42.0 | 10229 | ... | 47.0 | NaN | 42.0 | 0.8 | 417.891430 | NaN | 6.7 | 6.5 | NaN | NaN |
| 2379 | Somalia | 2013 | Developing | 54.2 | 318.0 | 51 | 0.01 | 0.0 | 42.0 | 3173 | ... | 47.0 | NaN | 42.0 | 0.8 | 47.543235 | NaN | 6.8 | 6.6 | NaN | NaN |
| 2380 | Somalia | 2012 | Developing | 53.1 | 336.0 | 51 | 0.01 | 0.0 | 42.0 | 9983 | ... | 47.0 | NaN | 42.0 | 0.8 | 47.543235 | NaN | 6.8 | 6.7 | NaN | NaN |
| 2381 | Somalia | 2011 | Developing | 53.1 | 329.0 | 51 | 0.01 | 0.0 | 42.0 | 17298 | ... | 49.0 | NaN | 41.0 | 0.8 | 47.543235 | NaN | 6.9 | 6.7 | NaN | NaN |
| 2382 | Somalia | 2010 | Developing | 52.4 | 336.0 | 52 | 0.01 | 0.0 | 42.0 | 115 | ... | 49.0 | NaN | 45.0 | 0.8 | 47.543235 | NaN | 7.0 | 6.8 | NaN | NaN |
| 2383 | Somalia | 2009 | Developing | 52.2 | 335.0 | 52 | 0.01 | 0.0 | 42.0 | 13 | ... | 41.0 | NaN | 42.0 | 0.8 | 47.543235 | NaN | 7.1 | 6.9 | NaN | NaN |
| 2384 | Somalia | 2008 | Developing | 51.9 | 336.0 | 52 | 0.01 | 0.0 | 42.0 | 1081 | ... | 4.0 | NaN | 31.0 | 0.9 | 47.543235 | NaN | 7.2 | 7.0 | NaN | NaN |
| 2385 | Somalia | 2007 | Developing | 51.5 | 34.0 | 52 | 0.01 | 0.0 | 42.0 | 1149 | ... | 4.0 | NaN | 4.0 | 0.9 | 47.543235 | NaN | 7.3 | 7.1 | NaN | NaN |
| 2386 | Somalia | 2006 | Developing | 51.5 | 337.0 | 51 | 0.01 | 0.0 | 42.0 | 7 | ... | 26.0 | NaN | 26.0 | 0.9 | 47.543235 | NaN | 7.4 | 7.2 | NaN | NaN |
| 2387 | Somalia | 2005 | Developing | 51.6 | 334.0 | 50 | 0.01 | 0.0 | 42.0 | 0 | ... | 35.0 | NaN | 35.0 | 0.9 | 47.543235 | NaN | 7.5 | 7.3 | NaN | NaN |
| 2388 | Somalia | 2004 | Developing | 51.2 | 341.0 | 49 | 0.01 | 0.0 | 42.0 | 12008 | ... | 3.0 | NaN | 3.0 | 0.9 | 47.543235 | NaN | 7.6 | 7.4 | NaN | NaN |
| 2389 | Somalia | 2003 | Developing | 51.1 | 344.0 | 48 | 0.01 | 0.0 | 42.0 | 8257 | ... | 4.0 | NaN | 4.0 | 0.9 | 47.543235 | NaN | 7.7 | 7.5 | NaN | NaN |
| 2390 | Somalia | 2002 | Developing | 58.0 | 348.0 | 47 | 0.01 | 0.0 | 42.0 | 9559 | ... | 4.0 | NaN | 4.0 | 0.9 | 47.543235 | NaN | 7.8 | 7.6 | NaN | NaN |
| 2391 | Somalia | 2001 | Developing | 57.0 | 352.0 | 46 | 0.01 | 0.0 | 42.0 | 3571 | ... | 33.0 | NaN | 33.0 | 0.8 | 47.543235 | NaN | 7.9 | 7.7 | NaN | NaN |
| 2392 | Somalia | 2000 | Developing | 55.0 | 355.0 | 45 | 0.01 | 0.0 | 42.0 | 3965 | ... | 37.0 | NaN | 33.0 | 0.8 | 47.543235 | NaN | 8.0 | 7.9 | NaN | NaN |
32 rows × 22 columns
data['Total_expenditure'] = data['Total_expenditure'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['Total_expenditure'])
data[np.isnan(data['Diphtheria'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data[np.isnan(data['HIV/AIDS'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data[np.isnan(data['GDP'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 161 | Bahamas | 2014 | Developing | 75.4 | 16.0 | 0 | 9.45 | 0.0 | 96.0 | 0 | ... | 96.0 | 7.74 | 96.0 | 0.1 | NaN | NaN | 2.5 | 2.5 | 0.789 | 12.6 |
| 162 | Bahamas | 2013 | Developing | 74.8 | 172.0 | 0 | 9.42 | 0.0 | 97.0 | 0 | ... | 97.0 | 7.50 | 97.0 | 0.1 | NaN | NaN | 2.5 | 2.5 | 0.790 | 12.6 |
| 163 | Bahamas | 2012 | Developing | 74.9 | 167.0 | 0 | 9.50 | 0.0 | 96.0 | 0 | ... | 99.0 | 7.43 | 98.0 | 0.2 | NaN | NaN | 2.5 | 2.5 | 0.789 | 12.6 |
| 164 | Bahamas | 2011 | Developing | 75.0 | 162.0 | 0 | 9.34 | 0.0 | 95.0 | 0 | ... | 97.0 | 7.63 | 98.0 | 0.1 | NaN | NaN | 2.5 | 2.5 | 0.788 | 12.6 |
| 165 | Bahamas | 2010 | Developing | 75.0 | 161.0 | 0 | 9.19 | 0.0 | 98.0 | 0 | ... | 97.0 | 7.44 | 99.0 | 0.2 | NaN | NaN | 2.5 | 2.5 | 0.788 | 12.6 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2901 | Yemen | 2004 | Developing | 62.2 | 247.0 | 42 | 0.06 | 0.0 | 43.0 | 12708 | ... | 72.0 | 4.90 | 72.0 | 0.1 | NaN | NaN | 13.9 | 13.9 | 0.464 | 8.4 |
| 2902 | Yemen | 2003 | Developing | 61.9 | 249.0 | 43 | 0.04 | 0.0 | 38.0 | 8536 | ... | 61.0 | 5.00 | 61.0 | 0.1 | NaN | NaN | 14.0 | 13.9 | 0.457 | 8.2 |
| 2903 | Yemen | 2002 | Developing | 61.5 | 25.0 | 45 | 0.07 | 0.0 | 31.0 | 890 | ... | 64.0 | 4.22 | 65.0 | 0.1 | NaN | NaN | 14.0 | 14.0 | 0.450 | 8.0 |
| 2904 | Yemen | 2001 | Developing | 61.1 | 251.0 | 46 | 0.08 | 0.0 | 19.0 | 485 | ... | 73.0 | 4.34 | 73.0 | 0.1 | NaN | NaN | 14.0 | 14.0 | 0.444 | 7.9 |
| 2905 | Yemen | 2000 | Developing | 68.0 | 252.0 | 48 | 0.07 | 0.0 | 14.0 | 0 | ... | 74.0 | 4.14 | 74.0 | 0.1 | NaN | NaN | 14.1 | 14.1 | 0.436 | 7.7 |
358 rows × 22 columns
data['GDP'] = data['GDP'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['GDP'])
data[np.isnan(data['Population'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 65 | Antigua and Barbuda | 2014 | Developing | 76.2 | 131.0 | 0 | 8.56 | 2422.999774 | 99.0 | 0 | ... | 96.0 | 5.54 | 99.0 | 0.2 | 12888.29667 | NaN | 3.3 | 3.3 | 0.782 | 13.9 |
| 66 | Antigua and Barbuda | 2013 | Developing | 76.1 | 133.0 | 0 | 8.58 | 1991.430372 | 99.0 | 0 | ... | 98.0 | 5.33 | 99.0 | 0.2 | 12224.86416 | NaN | 3.3 | 3.3 | 0.781 | 13.9 |
| 67 | Antigua and Barbuda | 2012 | Developing | 75.9 | 134.0 | 0 | 8.18 | 2156.229842 | 98.0 | 0 | ... | 97.0 | 5.39 | 98.0 | 0.2 | 12565.44197 | NaN | 3.3 | 3.3 | 0.778 | 13.8 |
| 68 | Antigua and Barbuda | 2011 | Developing | 75.7 | 136.0 | 0 | 7.84 | 1810.875316 | 99.0 | 0 | ... | 99.0 | 5.65 | 99.0 | 0.1 | 11929.34991 | NaN | 3.3 | 3.3 | 0.782 | 14.1 |
| 69 | Antigua and Barbuda | 2010 | Developing | 75.6 | 138.0 | 0 | 7.84 | 1983.956937 | 98.0 | 0 | ... | 99.0 | 5.63 | 98.0 | 0.1 | 12126.87614 | NaN | 3.3 | 3.3 | 0.783 | 14.1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2757 | United Arab Emirates | 2004 | Developing | 75.1 | 95.0 | 1 | 1.77 | 2972.448675 | 92.0 | 22 | ... | 94.0 | 2.46 | 94.0 | 0.1 | 36161.17610 | NaN | 5.2 | 4.9 | 0.813 | 12.4 |
| 2758 | United Arab Emirates | 2003 | Developing | 74.9 | 98.0 | 1 | 1.74 | 277.181833 | 92.0 | 42 | ... | 94.0 | 2.65 | 94.0 | 0.1 | 3323.52318 | NaN | 5.2 | 5.0 | 0.808 | 12.3 |
| 2759 | United Arab Emirates | 2002 | Developing | 74.7 | 11.0 | 1 | 1.72 | 2598.842827 | 92.0 | 53 | ... | 94.0 | 2.72 | 94.0 | 0.1 | 31311.35936 | NaN | 5.3 | 5.0 | 0.803 | 12.1 |
| 2760 | United Arab Emirates | 2001 | Developing | 74.5 | 14.0 | 1 | 1.67 | 243.753913 | 92.0 | 30 | ... | 94.0 | 2.48 | 94.0 | 0.1 | 3161.52935 | NaN | 5.3 | 5.1 | 0.798 | 12.0 |
| 2761 | United Arab Emirates | 2000 | Developing | 74.2 | 17.0 | 1 | 1.64 | 262.958958 | 92.0 | 69 | ... | 94.0 | 2.38 | 94.0 | 0.1 | 3371.26869 | NaN | 5.4 | 5.1 | 0.791 | 11.8 |
207 rows × 22 columns
data['Population'] = data['Population'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['Population'])
data[np.isnan(data['thinness_1-19_years'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data['thinness_1-19_years'] = data['thinness_1-19_years'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['thinness_1-19_years'])
data[np.isnan(data['thinness_5-9_years'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data['thinness_5-9_years'] = data['thinness_5-9_years'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['thinness_5-9_years'])
data[np.isnan(data['Income_composition_of_resources'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data['Income_composition_of_resources'] = data['Income_composition_of_resources'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['Income_composition_of_resources'])
data[np.isnan(data['Schooling'])]
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling |
|---|
0 rows × 22 columns
data['Schooling'] = data['Schooling'].replace('-', np.nan)
data = data.dropna(axis=0, subset=['Schooling'])
data.isnull().sum()
Country 0 Year 0 Status 0 Life_Expectancy 0 Adult_Mortality 0 Infant_deaths 0 Alcohol 0 Percentage_expenditure 0 Hepatitis_B 0 Measles 0 BMI 0 under-five-deaths 0 Polio 0 Total_expenditure 0 Diphtheria 0 HIV/AIDS 0 GDP 0 Population 0 thinness_1-19_years 0 thinness_5-9_years 0 Income_composition_of_resources 0 Schooling 0 dtype: int64
Above we started checking for the null or missing values for every column or every attribute we have in our dataframe and further we dropped or removed the rows that had null or missing values for all the columns.
# Create a dictionary of columns.
col_dict = {'Life_Expectancy':1, 'Adult_Mortality':2,'Infant_deaths':3, 'Alcohol':4,'Percentage_expenditure':5,
'Hepatitis_B': 6,'Measles':7,'BMI':8,'under-five-deaths':9,'Polio':10,'Total_expenditure':11,
'Diphtheria':12,'HIV/AIDS':13,'thinness_1-19_years':14, 'thinness_5-9_years': 15, 'GDP':16}
# Detect outliers in each variable using box plots.
plt.figure(figsize=(20,30))
for variable,i in col_dict.items():
plt.subplot(7,4,i)
plt.boxplot(data[variable])
plt.title(variable)
plt.show()
We can see that after one step of data cleaning and droping all the null or nan values for all the respective columns we do not have any missing data in our dataframe which is a part fo cleaning data and gives us a clean and pretty data and box plots.
data.boxplot(by='Status',column=['Life_Expectancy'], figsize=(5,4))
<AxesSubplot: title={'center': 'Life_Expectancy'}, xlabel='Status'>
We have constructed a box plot of Life Expectancy in relation to status which is for Developing and Developed countries. We can see that the life expectancy of the developed country is definitely good and pretty compared to the life expectancy of developing countries. Also the outliers in the developing countries shows us that developed countries have better analysis of life expectancy because we know that outliers are not good to have in our dataset.
data['Life_Expectancy'].plot(kind='box', title='BOX PLOT: Life Expectancies before removing outliers',figsize=(7,4))
plt.show()
We have constructed a box plot of life expectancy of our dataframe and we can see the outliers in our plot.
data.plot.scatter(x='Status', y='Life_Expectancy', c = 'red', s = 10,colormap='viridis',figsize=(10,10))
plt.title("Scatter plot: Life Expectancy in relation to status")
plt.xlabel("STATUS")
plt.ylabel("LIFE EXPECTANCY")
plt.show()
data.plot(x='Status', y='Life_Expectancy',figsize=(10,10), kind= 'hist', color='violet', edgecolor='blue')
plt.title("Histogram: Life Expectancy in relation to status")
plt.xlabel("STATUS")
plt.ylabel("LIFE EXPECTANCY")
plt.show()
Scatter Plot and Histogram:
data.plot.scatter(x='Year', y='Life_Expectancy', c = 'green', s = 10,colormap='viridis',figsize=(10,10))
plt.title("Scatter plot: Life Expectancy in relation to year")
plt.xlabel("YEAR")
plt.ylabel("LIFE EXPECTANCY")
plt.show()
Scatter Plot:
data.plot.scatter(x='Country', y='Life_Expectancy', c = 'blue', s = 10,colormap='viridis',figsize=(8,8))
plt.title("Scatter plot: Life Expectancy in relation to country")
plt.xlabel("COUNTRY")
plt.ylabel("LIFE EXPECTANCY")
plt.show()
data.plot.scatter(x='Population', y='Life_Expectancy', c = 'brown', s = 10,colormap='viridis',figsize=(8,8))
plt.title("Scatter plot: Life Expectancy in relation to year")
plt.xlabel("POPULATION")
plt.ylabel("LIFE EXPECTANCY")
plt.show()
data.plot.scatter(x='BMI', y='Life_Expectancy', c = 'purple', s = 10,colormap='viridis',figsize=(7,8))
plt.title("Scatter plot: Life Expectancy in relation to year")
plt.xlabel("BMI")
plt.ylabel("LIFE EXPECTANCY")
plt.show()
Above Plots:
def remove_outlier(df):
IQR = df['Life_Expectancy'].quantile(0.75) - df['Life_Expectancy'].quantile(0.25)
lower_range = df['Life_Expectancy'].quantile(0.25) - (1.5 * IQR)
upper_range = df['Life_Expectancy'].quantile(0.75) + (1.5 * IQR)
df.loc[df['Life_Expectancy'] <= lower_range, 'Life_Expectancy'] = lower_range
df.loc[df['Life_Expectancy'] >= upper_range, 'Life_Expectancy'] = upper_range
remove_outlier(data)
There are several ways to remove outliers. Here we are using Inter-Quartile-Range also known as IQR to remove outliers from our dataset for LIFE EXPECTANCY. Since the IQR is simply the range of the middle 50% of data values, it's not affected by extreme outliers.
data.boxplot(by='Status',column=['Life_Expectancy'])
<AxesSubplot: title={'center': 'Life_Expectancy'}, xlabel='Status'>
data['Life_Expectancy'].plot(kind='box', title='BOX PLOT: Life Expectancies after removing outliers',figsize=(4,3))
plt.show()
data.boxplot(by='Year',column=['Life_Expectancy'], figsize=(7,7))
<AxesSubplot: title={'center': 'Life_Expectancy'}, xlabel='Year'>
After removing the outliers:
data.isnull().sum()*100/len(data)
Country 0.0 Year 0.0 Status 0.0 Life_Expectancy 0.0 Adult_Mortality 0.0 Infant_deaths 0.0 Alcohol 0.0 Percentage_expenditure 0.0 Hepatitis_B 0.0 Measles 0.0 BMI 0.0 under-five-deaths 0.0 Polio 0.0 Total_expenditure 0.0 Diphtheria 0.0 HIV/AIDS 0.0 GDP 0.0 Population 0.0 thinness_1-19_years 0.0 thinness_5-9_years 0.0 Income_composition_of_resources 0.0 Schooling 0.0 dtype: float64
Therefore, We can see that when we check the number of missing values in our dataset or dataframe for each and every column we get zeroes (0) for all after doing the data cleaning where we dropped the missing values or null or nan values and removed our outliers and made our data clean and pretty to work and do our analysis further using machine learning and data science techniques.
developing = []
developed = []
for index,row in data_copy.iterrows():
if row[2] == "Developing":
developing.append(row)
else:
pass
developing = pd.DataFrame(developing)
Here we created a new dataframe where we added all the information from our original dataset or dataframe which only for status Developing and stored all the data and information in our dataframe named developing.
for index,row in data_copy.iterrows():
if row[2] == "Developed":
developed.append(row)
else:
pass
developed = pd.DataFrame(developed)
Here we created a new dataframe where we added all the information from our original dataset or dataframe which only for status Developed and stored all the data and information in our dataframe named developed.
developing.describe()
| Year | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | BMI | under-five-deaths | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2426.000000 | 2416.000000 | 2416.000000 | 2426.000000 | 2260.000000 | 2426.000000 | 2046.000000 | 2426.000000 | 2392.000000 | 2426.000000 | 2407.000000 | 2232.000000 | 2407.000000 | 2426.000000 | 2042.000000 | 1.870000e+03 | 2392.000000 | 2392.000000 | 2307.000000 | 2311.000000 |
| mean | 2007.522671 | 67.111465 | 182.833195 | 36.384171 | 3.484119 | 323.470285 | 79.763930 | 2824.926216 | 35.435326 | 50.525144 | 80.170752 | 5.590694 | 79.951807 | 2.088664 | 4286.556053 | 1.407108e+07 | 5.592935 | 5.635242 | 0.582310 | 11.219256 |
| std | 4.614690 | 9.006092 | 127.974557 | 128.942509 | 3.347537 | 846.655356 | 25.564884 | 12528.811419 | 19.425091 | 175.379909 | 24.671531 | 2.233756 | 24.834300 | 5.526145 | 8772.467789 | 6.702886e+07 | 4.514453 | 4.606130 | 0.201597 | 3.056601 |
| min | 2000.000000 | 36.300000 | 1.000000 | 0.000000 | 0.010000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 3.000000 | 0.370000 | 2.000000 | 0.100000 | 1.681350 | 3.400000e+01 | 0.100000 | 0.100000 | 0.000000 | 0.000000 |
| 25% | 2004.000000 | 61.100000 | 92.000000 | 1.000000 | 0.517500 | 3.616102 | 75.000000 | 0.000000 | 18.300000 | 1.000000 | 74.000000 | 4.140000 | 75.000000 | 0.100000 | 382.749830 | 1.938082e+05 | 2.100000 | 2.100000 | 0.466500 | 9.600000 |
| 50% | 2008.000000 | 69.000000 | 163.000000 | 6.000000 | 2.560000 | 48.431829 | 91.000000 | 18.000000 | 35.200000 | 7.000000 | 91.000000 | 5.400000 | 91.000000 | 0.100000 | 1246.021671 | 1.404827e+06 | 4.500000 | 4.600000 | 0.631000 | 11.700000 |
| 75% | 2012.000000 | 74.000000 | 253.000000 | 28.000000 | 5.750000 | 257.702204 | 97.000000 | 514.500000 | 53.200000 | 39.000000 | 97.000000 | 6.830000 | 96.500000 | 1.400000 | 4147.739877 | 7.735673e+06 | 7.725000 | 7.800000 | 0.727000 | 13.200000 |
| max | 2015.000000 | 89.000000 | 723.000000 | 1800.000000 | 17.870000 | 9748.636237 | 99.000000 | 212183.000000 | 87.300000 | 2500.000000 | 99.000000 | 17.240000 | 99.000000 | 50.600000 | 88564.822980 | 1.293859e+09 | 27.700000 | 28.600000 | 0.919000 | 18.300000 |
We are displaying the statistical information in terms of number of rows or tuples, mean or average, standard deviation, 25%, median, 75% and maximum values for all the columns for developing countries.
developed.describe()
| Year | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | BMI | under-five-deaths | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 512.000000 | 512.000000 | 512.000000 | 512.000000 | 484.000000 | 512.000000 | 339.000000 | 512.000000 | 512.000000 | 512.000000 | 512.000000 | 480.000000 | 512.000000 | 5.120000e+02 | 448.000000 | 4.160000e+02 | 512.000000 | 512.000000 | 464.000000 | 464.000000 |
| mean | 2007.500000 | 79.197852 | 79.685547 | 1.494141 | 9.826736 | 2703.600380 | 88.041298 | 499.005859 | 51.803906 | 1.810547 | 93.736328 | 7.554042 | 93.476562 | 1.000000e-01 | 22053.386446 | 6.830053e+06 | 1.320703 | 1.296680 | 0.852489 | 15.845474 |
| std | 4.614281 | 3.930942 | 47.877583 | 4.585774 | 2.765858 | 3824.200588 | 20.489240 | 2529.084588 | 17.196829 | 5.384006 | 10.783713 | 2.984389 | 12.531113 | 1.389136e-17 | 22870.827763 | 1.479524e+07 | 0.756577 | 0.829099 | 0.052843 | 1.766799 |
| min | 2000.000000 | 69.900000 | 1.000000 | 0.000000 | 0.010000 | 0.000000 | 2.000000 | 0.000000 | 3.200000 | 0.000000 | 9.000000 | 1.100000 | 9.000000 | 1.000000e-01 | 12.277330 | 1.230000e+02 | 0.300000 | 0.200000 | 0.703000 | 11.500000 |
| 25% | 2003.750000 | 76.800000 | 58.000000 | 0.000000 | 8.617500 | 92.904052 | 89.000000 | 0.000000 | 53.775000 | 0.000000 | 93.000000 | 6.407500 | 93.750000 | 1.000000e-01 | 3875.740910 | 1.993282e+05 | 0.700000 | 0.600000 | 0.815000 | 14.700000 |
| 50% | 2007.500000 | 79.250000 | 73.000000 | 0.000000 | 10.320000 | 846.615644 | 95.000000 | 12.000000 | 57.450000 | 0.000000 | 96.000000 | 7.895000 | 96.000000 | 1.000000e-01 | 13560.723860 | 1.167660e+06 | 1.100000 | 1.000000 | 0.862000 | 15.800000 |
| 75% | 2011.250000 | 81.700000 | 96.000000 | 1.000000 | 11.697500 | 4102.863046 | 97.000000 | 96.500000 | 61.300000 | 2.000000 | 98.000000 | 9.212500 | 98.000000 | 1.000000e-01 | 36760.425993 | 5.759450e+06 | 1.900000 | 1.900000 | 0.894000 | 16.800000 |
| max | 2015.000000 | 89.000000 | 229.000000 | 28.000000 | 15.190000 | 19479.911610 | 99.000000 | 33812.000000 | 69.600000 | 33.000000 | 99.000000 | 17.600000 | 99.000000 | 1.000000e-01 | 119172.741800 | 8.253418e+07 | 4.000000 | 4.300000 | 0.948000 | 20.700000 |
We are displaying the statistical information in terms of number of rows or tuples, mean or average, standard deviation, 25%, median, 75% and maximum values for all the columns for developed countries.
developed.head()
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 112 | Australia | 2015 | Developed | 82.8 | 59.0 | 1 | NaN | 0.00000 | 93.0 | 74 | ... | 93.0 | NaN | 93.0 | 0.1 | 56554.38760 | 23789338.0 | 0.6 | 0.6 | 0.937 | 20.4 |
| 113 | Australia | 2014 | Developed | 82.7 | 6.0 | 1 | 9.71 | 10769.36305 | 91.0 | 340 | ... | 92.0 | 9.42 | 92.0 | 0.1 | 62214.69120 | 2346694.0 | 0.6 | 0.6 | 0.936 | 20.4 |
| 114 | Australia | 2013 | Developed | 82.5 | 61.0 | 1 | 9.87 | 11734.85381 | 91.0 | 158 | ... | 91.0 | 9.36 | 91.0 | 0.1 | 67792.33860 | 23117353.0 | 0.6 | 0.6 | 0.933 | 20.3 |
| 115 | Australia | 2012 | Developed | 82.3 | 61.0 | 1 | 10.03 | 11714.99858 | 91.0 | 199 | ... | 92.0 | 9.36 | 92.0 | 0.1 | 67677.63477 | 22728254.0 | 0.6 | 0.6 | 0.930 | 20.1 |
| 116 | Australia | 2011 | Developed | 82.0 | 63.0 | 1 | 10.30 | 10986.26527 | 92.0 | 190 | ... | 92.0 | 9.20 | 92.0 | 0.1 | 62245.12900 | 223424.0 | 0.6 | 0.6 | 0.927 | 19.8 |
5 rows × 22 columns
developing.head()
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
| 1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
| 2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
| 3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
| 4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
5 rows × 22 columns
developing.isnull().sum()*100/len(developing)
Country 0.000000 Year 0.000000 Status 0.000000 Life_Expectancy 0.412201 Adult_Mortality 0.412201 Infant_deaths 0.000000 Alcohol 6.842539 Percentage_expenditure 0.000000 Hepatitis_B 15.663644 Measles 0.000000 BMI 1.401484 under-five-deaths 0.000000 Polio 0.783182 Total_expenditure 7.996702 Diphtheria 0.783182 HIV/AIDS 0.000000 GDP 15.828524 Population 22.918384 thinness_1-19_years 1.401484 thinness_5-9_years 1.401484 Income_composition_of_resources 4.905194 Schooling 4.740313 dtype: float64
We are computing the percentage of null values or missing values in our columns for the dataframe that we created separately for developing countries to make sure our data is clean, pretty and good enough so that when we apply machine learning and data science algorithms we do not run into problems and efficiency issues.
developing.loc[(developing.Hepatitis_B.isna()), 'Hepatitis_B'] = developing.Hepatitis_B.mean()
developing.loc[(developing.GDP.isna()), 'GDP'] = developing.GDP.mean()
developing.loc[(developing.Population.isna()), 'Population'] = developing.Population.mean()
developing.loc[(developing.Alcohol.isna()), 'Alcohol'] = developing.Alcohol.mean()
developing.loc[(developing.BMI.isna()), 'BMI'] = developing.BMI.mean()
developing.loc[(developing['thinness_1-19_years'].isna()), 'thinness_1-19_years'] = developing['thinness_1-19_years'].mean()
developing.loc[(developing['thinness_5-9_years'].isna()), 'thinness_5-9_years'] = developing['thinness_5-9_years'].mean()
developing.loc[(developing.Income_composition_of_resources.isna()), 'Income_composition_of_resources'] = developing.Income_composition_of_resources.mean()
developing.loc[(developing.Schooling.isna()), 'Schooling'] = developing.Schooling.mean()
developing.loc[(developing.Total_expenditure.isna()), 'Total_expenditure'] = developing.Total_expenditure.mean()
developing.isnull().sum()*100/len(developing)
Country 0.000000 Year 0.000000 Status 0.000000 Life_Expectancy 0.412201 Adult_Mortality 0.412201 Infant_deaths 0.000000 Alcohol 0.000000 Percentage_expenditure 0.000000 Hepatitis_B 0.000000 Measles 0.000000 BMI 0.000000 under-five-deaths 0.000000 Polio 0.783182 Total_expenditure 0.000000 Diphtheria 0.783182 HIV/AIDS 0.000000 GDP 0.000000 Population 0.000000 thinness_1-19_years 0.000000 thinness_5-9_years 0.000000 Income_composition_of_resources 0.000000 Schooling 0.000000 dtype: float64
As previously we dropped null values but here we are applying another technique of handling missing values or nan or null values is by replacing them by the average or mean of those specific column.
developed.isnull().sum()*100/len(developed)
Country 0.000000 Year 0.000000 Status 0.000000 Life_Expectancy 0.000000 Adult_Mortality 0.000000 Infant_deaths 0.000000 Alcohol 5.468750 Percentage_expenditure 0.000000 Hepatitis_B 33.789062 Measles 0.000000 BMI 0.000000 under-five-deaths 0.000000 Polio 0.000000 Total_expenditure 6.250000 Diphtheria 0.000000 HIV/AIDS 0.000000 GDP 12.500000 Population 18.750000 thinness_1-19_years 0.000000 thinness_5-9_years 0.000000 Income_composition_of_resources 9.375000 Schooling 9.375000 dtype: float64
We are computing the percentage of null values or missing values in our columns for the dataframe that we created separately for developed countries to make sure our data is clean, pretty and good enough so that when we apply machine learning and data science algorithms we do not run into problems and efficiency issues.
developed.loc[(developed.Hepatitis_B.isna()), 'Hepatitis_B'] = developed.Hepatitis_B.mean()
developed.loc[(developed.GDP.isna()), 'GDP'] = developed.GDP.mean()
developed.loc[(developed.Population.isna()), 'Population'] = developed.Population.mean()
developed.loc[(developed.Alcohol.isna()), 'Alcohol'] = developed.Alcohol.mean()
developed.loc[(developed.BMI.isna()), 'BMI'] = developed.BMI.mean()
developed.loc[(developed.Income_composition_of_resources.isna()), 'Income_composition_of_resources'] = developed.Income_composition_of_resources.mean()
developed.loc[(developed.Schooling.isna()), 'Schooling'] = developed.Schooling.mean()
developed.loc[(developed.Total_expenditure.isna()), 'Total_expenditure'] = developed.Total_expenditure.mean()
developed.isnull().sum()*100/len(developed)
Country 0.0 Year 0.0 Status 0.0 Life_Expectancy 0.0 Adult_Mortality 0.0 Infant_deaths 0.0 Alcohol 0.0 Percentage_expenditure 0.0 Hepatitis_B 0.0 Measles 0.0 BMI 0.0 under-five-deaths 0.0 Polio 0.0 Total_expenditure 0.0 Diphtheria 0.0 HIV/AIDS 0.0 GDP 0.0 Population 0.0 thinness_1-19_years 0.0 thinness_5-9_years 0.0 Income_composition_of_resources 0.0 Schooling 0.0 dtype: float64
As previously we dropped null values but here we are applying another technique of handling missing values or nan or null values is by replacing them by the average or mean of those specific column. We could also use other methods such as replace with median, standard deviation or intepolate().
developed.plot(x='Year', y='Life_Expectancy',color = 'pink', kind = 'hist', figsize=(9,9), edgecolor = "black")
plt.xlabel("Year")
plt.ylabel("Frequency")
plt.title("Histogram: Year and Life Expectancy for Developed Countries")
developing.plot(x='Year', y='Life_Expectancy', color = 'orange', kind = 'hist', figsize=(9,9),edgecolor = "black")
plt.xlabel("Year")
plt.ylabel("Frequency")
plt.title("Histogram: Year and Life Expectancy for Developing Countries")
Text(0.5, 1.0, 'Histogram: Year and Life Expectancy for Developing Countries')
Above Plots:
plt.figure(figsize=(10,7))
sns.violinplot(x="Year",y="Life_Expectancy",data=developed)
plt.show()
plt.figure(figsize=(10,8))
sns.violinplot(x="Year",y="Life_Expectancy",data=developing)
plt.show()
Above Plots: Above we have violin plot as a part of visualization for developing and developed countries after dealing with missing values by replacing them with the average which is mean. We can observe and visualize that the statistical values and the plots of developed countries do some good source in terms of good economic situation for instance of the economies or countries compared to the developing countries.
plt.figure(figsize=(17,7))
sns.barplot(x="Status",y="Life_Expectancy",hue="Year",data=data)
plt.title("Life expectancy over time of Status", size = 15)
plt.xlabel("Status")
plt.ylabel("Life Expectancy")
plt.show()
Above Plots:
data.corr()
C:\Users\HP\AppData\Local\Temp\ipykernel_17484\2627137660.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. data.corr()
| Year | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | BMI | under-five-deaths | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | 1.000000 | 0.171593 | -0.072108 | -0.037601 | -0.046859 | 0.089096 | 0.247259 | -0.099554 | 0.096059 | -0.042479 | 0.117642 | 0.074139 | 0.166006 | -0.142581 | 0.119355 | 0.022775 | -0.047477 | -0.053483 | 0.242900 | 0.225046 |
| Life_Expectancy | 0.171593 | 1.000000 | -0.659853 | -0.161230 | 0.393336 | 0.414151 | 0.249875 | -0.138435 | 0.600424 | -0.187739 | 0.415326 | 0.200832 | 0.443123 | -0.577381 | 0.444095 | -0.010911 | -0.459704 | -0.451058 | 0.727357 | 0.745245 |
| Adult_Mortality | -0.072108 | -0.659853 | 1.000000 | 0.038304 | -0.181469 | -0.242438 | -0.103382 | -0.007269 | -0.372519 | 0.052865 | -0.208006 | -0.096727 | -0.210136 | 0.536273 | -0.256955 | -0.022403 | 0.278842 | 0.284581 | -0.411010 | -0.404160 |
| Infant_deaths | -0.037601 | -0.161230 | 0.038304 | 1.000000 | -0.104406 | -0.089772 | -0.216949 | 0.509747 | -0.227769 | 0.996729 | -0.152153 | -0.147961 | -0.156470 | 0.001739 | -0.097720 | 0.562805 | 0.481580 | 0.487596 | -0.137175 | -0.195815 |
| Alcohol | -0.046859 | 0.393336 | -0.181469 | -0.104406 | 1.000000 | 0.430835 | 0.106383 | -0.029252 | 0.379327 | -0.099713 | 0.239854 | 0.227108 | 0.245454 | -0.033756 | 0.458112 | -0.025556 | -0.398731 | -0.387087 | 0.532173 | 0.590261 |
| Percentage_expenditure | 0.089096 | 0.414151 | -0.242438 | -0.089772 | 0.430835 | 1.000000 | -0.011530 | -0.069316 | 0.277788 | -0.092480 | 0.162606 | 0.217103 | 0.168910 | -0.109680 | 0.963177 | -0.016607 | -0.268347 | -0.268131 | 0.408174 | 0.426801 |
| Hepatitis_B | 0.247259 | 0.249875 | -0.103382 | -0.216949 | 0.106383 | -0.011530 | 1.000000 | -0.142059 | 0.198627 | -0.226512 | 0.451299 | 0.130435 | 0.552732 | -0.104034 | 0.009077 | -0.092418 | -0.166903 | -0.181161 | 0.239386 | 0.268951 |
| Measles | -0.099554 | -0.138435 | -0.007269 | 0.509747 | -0.029252 | -0.069316 | -0.142059 | 1.000000 | -0.168172 | 0.519173 | -0.113574 | -0.111638 | -0.119828 | 0.019600 | -0.073858 | 0.269840 | 0.227961 | 0.223325 | -0.137648 | -0.146625 |
| BMI | 0.096059 | 0.600424 | -0.372519 | -0.227769 | 0.379327 | 0.277788 | 0.198627 | -0.168172 | 1.000000 | -0.238155 | 0.264753 | 0.233643 | 0.266601 | -0.239171 | 0.300517 | -0.071306 | -0.560982 | -0.567943 | 0.539862 | 0.584998 |
| under-five-deaths | -0.042479 | -0.187739 | 0.052865 | 0.996729 | -0.099713 | -0.092480 | -0.226512 | 0.519173 | -0.238155 | 1.000000 | -0.169989 | -0.148325 | -0.177302 | 0.013390 | -0.101404 | 0.548761 | 0.483954 | 0.488759 | -0.157039 | -0.212678 |
| Polio | 0.117642 | 0.415326 | -0.208006 | -0.152153 | 0.239854 | 0.162606 | 0.451299 | -0.113574 | 0.264753 | -0.169989 | 1.000000 | 0.153724 | 0.680436 | -0.132156 | 0.188618 | -0.026926 | -0.204346 | -0.205516 | 0.395265 | 0.437793 |
| Total_expenditure | 0.074139 | 0.200832 | -0.096727 | -0.147961 | 0.227108 | 0.217103 | 0.130435 | -0.111638 | 0.233643 | -0.148325 | 0.153724 | 1.000000 | 0.166398 | 0.025389 | 0.211375 | -0.078289 | -0.233957 | -0.249369 | 0.194385 | 0.256340 |
| Diphtheria | 0.166006 | 0.443123 | -0.210136 | -0.156470 | 0.245454 | 0.168910 | 0.552732 | -0.119828 | 0.266601 | -0.177302 | 0.680436 | 0.166398 | 1.000000 | -0.142735 | 0.190969 | -0.022151 | -0.223845 | -0.214652 | 0.436006 | 0.458794 |
| HIV/AIDS | -0.142581 | -0.577381 | 0.536273 | 0.001739 | -0.033756 | -0.109680 | -0.104034 | 0.019600 | -0.239171 | 0.013390 | -0.132156 | 0.025389 | -0.142735 | 1.000000 | -0.121314 | -0.032315 | 0.193041 | 0.195961 | -0.242581 | -0.209940 |
| GDP | 0.119355 | 0.444095 | -0.256955 | -0.097720 | 0.458112 | 0.963177 | 0.009077 | -0.073858 | 0.300517 | -0.101404 | 0.188618 | 0.211375 | 0.190969 | -0.121314 | 1.000000 | -0.019751 | -0.288131 | -0.287875 | 0.450799 | 0.468299 |
| Population | 0.022775 | -0.010911 | -0.022403 | 0.562805 | -0.025556 | -0.016607 | -0.092418 | 0.269840 | -0.071306 | 0.548761 | -0.026926 | -0.078289 | -0.022151 | -0.032315 | -0.019751 | 1.000000 | 0.259960 | 0.257564 | 0.001611 | -0.024767 |
| thinness_1-19_years | -0.047477 | -0.459704 | 0.278842 | 0.481580 | -0.398731 | -0.268347 | -0.166903 | 0.227961 | -0.560982 | 0.483954 | -0.204346 | -0.233957 | -0.223845 | 0.193041 | -0.288131 | 0.259960 | 1.000000 | 0.926872 | -0.432073 | -0.464851 |
| thinness_5-9_years | -0.053483 | -0.451058 | 0.284581 | 0.487596 | -0.387087 | -0.268131 | -0.181161 | 0.223325 | -0.567943 | 0.488759 | -0.205516 | -0.249369 | -0.214652 | 0.195961 | -0.287875 | 0.257564 | 0.926872 | 1.000000 | -0.414205 | -0.448979 |
| Income_composition_of_resources | 0.242900 | 0.727357 | -0.411010 | -0.137175 | 0.532173 | 0.408174 | 0.239386 | -0.137648 | 0.539862 | -0.157039 | 0.395265 | 0.194385 | 0.436006 | -0.242581 | 0.450799 | 0.001611 | -0.432073 | -0.414205 | 1.000000 | 0.806551 |
| Schooling | 0.225046 | 0.745245 | -0.404160 | -0.195815 | 0.590261 | 0.426801 | 0.268951 | -0.146625 | 0.584998 | -0.212678 | 0.437793 | 0.256340 | 0.458794 | -0.209940 | 0.468299 | -0.024767 | -0.464851 | -0.448979 | 0.806551 | 1.000000 |
The above command which is data.corr() basically gives us correlations for all the numerical columns in our dataframe.
plt.figure(figsize=(15,15))
correlation = data.corr()
sns.heatmap(correlation, annot = True, square = True, linewidths=.5)#,cmap="Greens")
plt.title("Correlation Matrix")
plt.show()
C:\Users\HP\AppData\Local\Temp\ipykernel_17484\3253367397.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. correlation = data.corr()
We know that a correlation coefficient can take the values from -1 through +1.
Feature Engineering: There are several techniques of Feature Engineering we did Label Encoding, One-Hot Encoding and Binning.
Label Encoding: It basically converts categorical data into numerical data which is some numbers
Created a copy of our orginial dataframe and converted all categorical data into the numerical form
def Encoder(data_copy):
columnsToEncode = list(data_copy.select_dtypes(include=['category','object']))
le = LabelEncoder()
for feature in columnsToEncode:
try:
data_copy[feature] = le.fit_transform(data_copy[feature])
except:
print('Error encoding '+feature)
return data_copy
Encoder(data_copy)
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2015 | 1 | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
| 1 | 0 | 2014 | 1 | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
| 2 | 0 | 2013 | 1 | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
| 3 | 0 | 2012 | 1 | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
| 4 | 0 | 2011 | 1 | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | 192 | 2004 | 1 | 44.3 | 723.0 | 27 | 4.36 | 0.000000 | 68.0 | 31 | ... | 67.0 | 7.13 | 65.0 | 33.6 | 454.366654 | 12777511.0 | 9.4 | 9.4 | 0.407 | 9.2 |
| 2934 | 192 | 2003 | 1 | 44.5 | 715.0 | 26 | 4.06 | 0.000000 | 7.0 | 998 | ... | 7.0 | 6.52 | 68.0 | 36.7 | 453.351155 | 12633897.0 | 9.8 | 9.9 | 0.418 | 9.5 |
| 2935 | 192 | 2002 | 1 | 44.8 | 73.0 | 25 | 4.43 | 0.000000 | 73.0 | 304 | ... | 73.0 | 6.53 | 71.0 | 39.8 | 57.348340 | 125525.0 | 1.2 | 1.3 | 0.427 | 10.0 |
| 2936 | 192 | 2001 | 1 | 45.3 | 686.0 | 25 | 1.72 | 0.000000 | 76.0 | 529 | ... | 76.0 | 6.16 | 75.0 | 42.1 | 548.587312 | 12366165.0 | 1.6 | 1.7 | 0.427 | 9.8 |
| 2937 | 192 | 2000 | 1 | 46.0 | 665.0 | 24 | 1.68 | 0.000000 | 79.0 | 1483 | ... | 78.0 | 7.10 | 78.0 | 43.5 | 547.358878 | 12222251.0 | 11.0 | 11.2 | 0.434 | 9.8 |
2938 rows × 22 columns
One-Hot Encoding: In this technique, it spreads the values in a feature to multiple flag features and assigns values 0 or 1 to them.
OneHotEncoder().fit_transform(data_copy).toarray()
array([[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
pd.get_dummies(data_copy, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Polio | Total_expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2015 | 1 | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
| 1 | 0 | 2014 | 1 | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
| 2 | 0 | 2013 | 1 | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
| 3 | 0 | 2012 | 1 | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
| 4 | 0 | 2011 | 1 | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | 192 | 2004 | 1 | 44.3 | 723.0 | 27 | 4.36 | 0.000000 | 68.0 | 31 | ... | 67.0 | 7.13 | 65.0 | 33.6 | 454.366654 | 12777511.0 | 9.4 | 9.4 | 0.407 | 9.2 |
| 2934 | 192 | 2003 | 1 | 44.5 | 715.0 | 26 | 4.06 | 0.000000 | 7.0 | 998 | ... | 7.0 | 6.52 | 68.0 | 36.7 | 453.351155 | 12633897.0 | 9.8 | 9.9 | 0.418 | 9.5 |
| 2935 | 192 | 2002 | 1 | 44.8 | 73.0 | 25 | 4.43 | 0.000000 | 73.0 | 304 | ... | 73.0 | 6.53 | 71.0 | 39.8 | 57.348340 | 125525.0 | 1.2 | 1.3 | 0.427 | 10.0 |
| 2936 | 192 | 2001 | 1 | 45.3 | 686.0 | 25 | 1.72 | 0.000000 | 76.0 | 529 | ... | 76.0 | 6.16 | 75.0 | 42.1 | 548.587312 | 12366165.0 | 1.6 | 1.7 | 0.427 | 9.8 |
| 2937 | 192 | 2000 | 1 | 46.0 | 665.0 | 24 | 1.68 | 0.000000 | 79.0 | 1483 | ... | 78.0 | 7.10 | 78.0 | 43.5 | 547.358878 | 12222251.0 | 11.0 | 11.2 | 0.434 | 9.8 |
2938 rows × 22 columns
Binning the data in 4 and computing the mean for life expectancy and alcohol consumption. So basically here we are considering life expectancy and alcohol as two of our columns or attributes.
data['discretize'] = pd.cut(data.Year, 4, ordered=True, labels=None)
data['label'] = pd.cut(data.Year, 4, ordered=True, labels=['first','second','third','fourth'])
Here we bin into four parts based on year and labeled as first, second, third, and fourth.
data.head()
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | Diphtheria | HIV/AIDS | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | discretize | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 | (2011.25, 2015.0] | fourth |
| 1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 | (2011.25, 2015.0] | fourth |
| 2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 | (2011.25, 2015.0] | fourth |
| 3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 | (2011.25, 2015.0] | fourth |
| 4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 | (2007.5, 2011.25] | third |
5 rows × 24 columns
first=[]
second=[]
third=[]
fourth=[]
for index,row in data.iterrows():
if("first" not in row[23]):
pass
else:
first.append(row)
first = pd.DataFrame(first)
for index,row in data.iterrows():
if("second" not in row[23]):
pass
else:
second.append(row)
second = pd.DataFrame(second)
for index,row in data.iterrows():
if("third" not in row[23]):
pass
else:
third.append(row)
third = pd.DataFrame(third)
for index,row in data.iterrows():
if("fourth" not in row[23]):
pass
else:
fourth.append(row)
fourth = pd.DataFrame(fourth)
Above:
meansbyCountryfirst = first.groupby('Country')['Life_Expectancy', 'Alcohol'].mean()
df = pd.DataFrame(meansbyCountryfirst, columns=["Alcohol", "Life_Expectancy"])
ax = df.plot(x="Alcohol", y="Life_Expectancy", kind="scatter", figsize=(20,18), color="red")
plt.xlabel("Mean of Alcohol for first period")
plt.ylabel("Mean of Life Expectancy for first period")
plt.title("scatterplot showing mean Alcohol (y-axis) vs. mean Life_Expectancy (x-axis) for first period.")
df.reset_index(inplace=True)
for i, txt in enumerate(df.Country):
ax.annotate(txt, (df.Alcohol.iat[i]+0.05, df.Life_Expectancy.iat[i]))
plt.grid(True)
plt.show()
C:\Users\HP\AppData\Local\Temp\ipykernel_17484\570783628.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
meansbyCountryfirst = first.groupby('Country')['Life_Expectancy', 'Alcohol'].mean()
meansbyCountrysecond = second.groupby('Country')['Life_Expectancy', 'Alcohol'].mean()
df = pd.DataFrame(meansbyCountrysecond, columns=["Alcohol", "Life_Expectancy"])
ax = df.plot(x="Alcohol", y="Life_Expectancy", kind="scatter", figsize=(20,18), color="blue")
plt.xlabel("Mean of Alcohol for second period")
plt.ylabel("Mean of Life Expectancy for second period")
plt.title("scatterplot showing mean Alcohol (y-axis) vs. mean Life_Expectancy (x-axis) for second period.")
df.reset_index(inplace=True)
for i, txt in enumerate(df.Country):
ax.annotate(txt, (df.Alcohol.iat[i]+0.05, df.Life_Expectancy.iat[i]))
plt.grid(True)
plt.show()
C:\Users\HP\AppData\Local\Temp\ipykernel_17484\1031961739.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
meansbyCountrysecond = second.groupby('Country')['Life_Expectancy', 'Alcohol'].mean()
meansbyCountrythird = third.groupby('Country')['Life_Expectancy', 'Alcohol'].mean()
df = pd.DataFrame(meansbyCountrysecond, columns=["Alcohol", "Life_Expectancy"])
ax = df.plot(x="Alcohol", y="Life_Expectancy", kind="scatter", figsize=(20,18), color="green")
plt.xlabel("Mean of Alcohol for third period")
plt.ylabel("Mean of Life Expectancy for third period")
plt.title("scatterplot showing mean Alcohol (y-axis) vs. mean Life_Expectancy (x-axis) for third period.")
df.reset_index(inplace=True)
for i, txt in enumerate(df.Country):
ax.annotate(txt, (df.Alcohol.iat[i]+0.05, df.Life_Expectancy.iat[i]))
plt.grid(True)
plt.show()
C:\Users\HP\AppData\Local\Temp\ipykernel_17484\2990627058.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
meansbyCountrythird = third.groupby('Country')['Life_Expectancy', 'Alcohol'].mean()
meansbyCountryfourth = fourth.groupby('Country')['Life_Expectancy', 'Alcohol'].mean()
df = pd.DataFrame(meansbyCountryfourth, columns=["Alcohol", "Life_Expectancy"])
ax = df.plot(x="Alcohol", y="Life_Expectancy", kind="scatter", figsize=(20,18), color="brown")
plt.xlabel("Mean of Alcohol for fourth period")
plt.ylabel("Mean of Life Expectancy for fourth period")
plt.title("scatterplot showing mean Alcohol (y-axis) vs. mean Life_Expectancy (x-axis) for fourth period.")
df.reset_index(inplace=True)
for i, txt in enumerate(df.Country):
ax.annotate(txt, (df.Alcohol.iat[i]+0.05, df.Life_Expectancy.iat[i]))
plt.grid(True)
plt.show()
C:\Users\HP\AppData\Local\Temp\ipykernel_17484\3601612452.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
meansbyCountryfourth = fourth.groupby('Country')['Life_Expectancy', 'Alcohol'].mean()
Above Plots:
Steps of Hypothesis Testing:
The Shapiro-Wilk test is a way we can analyze if a random sample comes from a NORMAL DISTRIBUTION.
data['Life_Expectancy'].hist()
<AxesSubplot: >
DataToTest = data['Life_Expectancy']
stat, p = shapiro(DataToTest)
print('stat=%.2f, p = %.30f' % (stat,p))
if p > 0.05:
print('Normal Distribution')
else:
print('Not a Normal Distribution')
stat=0.96, p = 0.000000000000000000001508767845 Not a Normal Distribution
Here we ran Shapiro-Wilk test for a specific column of our dataset or dataframe which oru main consideration of the dataframe that is LIFE EXPECTANCY and we see that our p values is 0.0; so which is less than 0.05 therefore it is NOT NORMALLY DISTRIBUTED.
DataToTest = randn(100)
stat, p = shapiro(DataToTest)
print('stat=%.2f, p = %.30f' % (stat,p))
if p > 0.05:
print('Normal Distribution')
else:
print('Not a Normal Distribution')
stat=0.98, p = 0.111630119383335113525390625000 Normal Distribution
Above:
stats.ttest_1samp(a=data_copy, popmean=310)
Ttest_1sampResult(statistic=array([-2.07103082e+02, 1.99424196e+04, -4.41699317e+04, nan,
nan, -1.28558506e+02, nan, 1.16768796e+01,
nan, 9.97158699e+00, nan, -9.05263686e+01,
nan, nan, nan, -3.29052987e+03,
nan, nan, nan, nan,
nan, nan]), pvalue=array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, nan,
nan, 0.00000000e+00, nan, 7.94503419e-31,
nan, 4.69949919e-23, nan, 0.00000000e+00,
nan, nan, nan, 0.00000000e+00,
nan, nan, nan, nan,
nan, nan]))
ztest ,pval = stests.ztest(data['Life_Expectancy'], x2=None, value=156)
print(float(pval))
if pval<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")
0.0 reject null hypothesis
We run Z-test only if:
sns.lmplot(x='Year',y='Life_Expectancy',data=data,fit_reg=True, line_kws={"color": "C8"})
<seaborn.axisgrid.FacetGrid at 0x241ded59a50>
Here we are fitting a LIENAR REGRESSION MODEL for Life Expectancy vs year
models = smf.ols(formula='Life_Expectancy ~ Year', data=data).fit()
models.summary()
| Dep. Variable: | Life_Expectancy | R-squared: | 0.029 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.029 |
| Method: | Least Squares | F-statistic: | 60.22 |
| Date: | Sun, 11 Dec 2022 | Prob (F-statistic): | 1.35e-14 |
| Time: | 18:28:15 | Log-Likelihood: | -7302.0 |
| No. Observations: | 1987 | AIC: | 1.461e+04 |
| Df Residuals: | 1985 | BIC: | 1.462e+04 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -703.9591 | 99.476 | -7.077 | 0.000 | -899.048 | -508.870 |
| Year | 0.3846 | 0.050 | 7.760 | 0.000 | 0.287 | 0.482 |
| Omnibus: | 116.097 | Durbin-Watson: | 0.161 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 99.112 |
| Skew: | -0.473 | Prob(JB): | 3.01e-22 |
| Kurtosis: | 2.451 | Cond. No. | 9.32e+05 |
Analysis:
Here we are doing RESIDUAL VS YEAR for the linear model
predict = models.predict(data.loc[:,['Year']])
data['newpredict'] = predict
data['Residual'] = data['Life_Expectancy'] - data['newpredict']
data.head()
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | discretize | label | newpredict | Residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 | (2011.25, 2015.0] | fourth | 71.070164 | -6.070164 |
| 1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 | (2011.25, 2015.0] | fourth | 70.685534 | -10.785534 |
| 2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 | (2011.25, 2015.0] | fourth | 70.300904 | -10.400904 |
| 3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 | (2011.25, 2015.0] | fourth | 69.916274 | -10.416274 |
| 4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 | (2007.5, 2011.25] | third | 69.531644 | -10.331644 |
5 rows × 26 columns
sns.violinplot(x='Year', y='Residual',data=data, title='Violin Plot using Seaborn via statsmodel')
<AxesSubplot: xlabel='Year', ylabel='Residual'>
sns.lmplot(x='Year',y='Residual',data=data,fit_reg=True, line_kws={"color": "C8"})
<seaborn.axisgrid.FacetGrid at 0x241dfa45060>
Above:
data['Residual'].plot(kind='box', title='BOX PLOT: Residual',figsize=(7,4))
plt.show()
Here we made a box plot of our residual that we computed and seems to be good as we do not have any outliers which is good!!
sns.lmplot(x="Year",y="Life_Expectancy",hue="Status",data=data)
ax = plt.gca()
ax.set_title("Plot for all continents with respect to years and life expectancies along with a regression line")
Text(0.5, 1.0, 'Plot for all continents with respect to years and life expectancies along with a regression line')
Above Plot:
sns.lmplot(x="Year",y="Residual",hue="Status",data=data)
ax = plt.gca()
ax.set_title("Plot for all continents with respect to years and life expectancies along with a regression line")
Text(0.5, 1.0, 'Plot for all continents with respect to years and life expectancies along with a regression line')
Above Plot:
data.columns
Index(['Country', 'Year', 'Status', 'Life_Expectancy', 'Adult_Mortality',
'Infant_deaths', 'Alcohol', 'Percentage_expenditure', 'Hepatitis_B',
'Measles', 'BMI', 'under-five-deaths', 'Polio', 'Total_expenditure',
'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness_1-19_years',
'thinness_5-9_years', 'Income_composition_of_resources', 'Schooling',
'discretize', 'label', 'newpredict', 'Residual'],
dtype='object')
def Encoder(data):
columnsToEncode = list(data.select_dtypes(include=['category','object']))
le = LabelEncoder()
for feature in columnsToEncode:
try:
data[feature] = le.fit_transform(data[feature])
except:
print('Error encoding '+feature)
return data
Encoder(data)
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | discretize | label | newpredict | Residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2015 | 1 | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 | 3 | 1 | 71.070164 | -6.070164 |
| 1 | 0 | 2014 | 1 | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 | 3 | 1 | 70.685534 | -10.785534 |
| 2 | 0 | 2013 | 1 | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 | 3 | 1 | 70.300904 | -10.400904 |
| 3 | 0 | 2012 | 1 | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 | 3 | 1 | 69.916274 | -10.416274 |
| 4 | 0 | 2011 | 1 | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 | 2 | 3 | 69.531644 | -10.331644 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | 132 | 2004 | 1 | 44.3 | 723.0 | 27 | 4.36 | 0.000000 | 68.0 | 31 | ... | 454.366654 | 12777511.0 | 9.4 | 9.4 | 0.407 | 9.2 | 1 | 2 | 66.839235 | -22.539235 |
| 2934 | 132 | 2003 | 1 | 44.5 | 715.0 | 26 | 4.06 | 0.000000 | 7.0 | 998 | ... | 453.351155 | 12633897.0 | 9.8 | 9.9 | 0.418 | 9.5 | 0 | 0 | 66.454605 | -21.954605 |
| 2935 | 132 | 2002 | 1 | 44.8 | 73.0 | 25 | 4.43 | 0.000000 | 73.0 | 304 | ... | 57.348340 | 125525.0 | 1.2 | 1.3 | 0.427 | 10.0 | 0 | 0 | 66.069975 | -21.269975 |
| 2936 | 132 | 2001 | 1 | 45.3 | 686.0 | 25 | 1.72 | 0.000000 | 76.0 | 529 | ... | 548.587312 | 12366165.0 | 1.6 | 1.7 | 0.427 | 9.8 | 0 | 0 | 65.685345 | -20.385345 |
| 2937 | 132 | 2000 | 1 | 46.0 | 665.0 | 24 | 1.68 | 0.000000 | 79.0 | 1483 | ... | 547.358878 | 12222251.0 | 11.0 | 11.2 | 0.434 | 9.8 | 0 | 0 | 65.300715 | -19.300715 |
1987 rows × 26 columns
X = data[['Country','Year', 'Status', 'Adult_Mortality',
'Infant_deaths', 'Alcohol', 'Percentage_expenditure', 'Hepatitis_B',
'Measles', 'BMI', 'under-five-deaths', 'Polio', 'Total_expenditure',
'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness_1-19_years',
'thinness_5-9_years', 'Income_composition_of_resources', 'Schooling',
'discretize', 'label', 'newpredict']]
y = data['Life_Expectancy']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
model = LinearRegression()
model.fit(x_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
print("Model Cooeficient: ",model.coef_)
Model Cooeficient: [ 1.90990608e-03 -1.78622357e-01 -1.31506496e+00 -1.48602770e-02 9.20675290e-02 -1.74915620e-01 3.20181751e-04 -1.43517253e-03 3.08617227e-07 5.79629880e-02 -6.79653079e-02 9.20289727e-03 6.30206120e-02 2.06165615e-02 -4.55328940e-01 2.66079354e-05 -1.43637620e-09 -7.62714790e-02 5.77338694e-03 9.15155662e+00 8.84643133e-01 5.69496329e-01 4.33145076e-03 -6.87035038e-02]
print("Model Intercept: ",model.intercept_)
Model Intercept: 415.2820949977114
pd.DataFrame(model.coef_, X.columns, columns = ['Coeff'])
| Coeff | |
|---|---|
| Country | 1.909906e-03 |
| Year | -1.786224e-01 |
| Status | -1.315065e+00 |
| Adult_Mortality | -1.486028e-02 |
| Infant_deaths | 9.206753e-02 |
| Alcohol | -1.749156e-01 |
| Percentage_expenditure | 3.201818e-04 |
| Hepatitis_B | -1.435173e-03 |
| Measles | 3.086172e-07 |
| BMI | 5.796299e-02 |
| under-five-deaths | -6.796531e-02 |
| Polio | 9.202897e-03 |
| Total_expenditure | 6.302061e-02 |
| Diphtheria | 2.061656e-02 |
| HIV/AIDS | -4.553289e-01 |
| GDP | 2.660794e-05 |
| Population | -1.436376e-09 |
| thinness_1-19_years | -7.627148e-02 |
| thinness_5-9_years | 5.773387e-03 |
| Income_composition_of_resources | 9.151557e+00 |
| Schooling | 8.846431e-01 |
| discretize | 5.694963e-01 |
| label | 4.331451e-03 |
| newpredict | -6.870350e-02 |
predictions = model.predict(x_test)
plt.scatter(y_test, predictions)
<matplotlib.collections.PathCollection at 0x241dea4fc40>
plt.hist(y_test - predictions)
(array([ 6., 8., 45., 145., 235., 118., 35., 4., 0., 1.]),
array([-15.47486166, -11.95338139, -8.43190111, -4.91042084,
-1.38894056, 2.13253972, 5.65401999, 9.17550027,
12.69698054, 16.21846082, 19.7399411 ]),
<BarContainer object of 10 artists>)
print("Mean absolute Error: ",metrics.mean_absolute_error(y_test, predictions))
Mean absolute Error: 3.007531505775279
print("Mean Squared Error: ",metrics.mean_squared_error(y_test, predictions))
Mean Squared Error: 15.682662983452122
print("Root mean squared Error: ",np.sqrt(metrics.mean_squared_error(y_test, predictions)))
Root mean squared Error: 3.96013421280796
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=4)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(1589, 24) (1589,) (398, 24) (398,)
sc = StandardScaler()
#Fitting the X training set
x_train = sc.fit_transform(x_train)
#Fitting the Y training set
x_test = sc.fit_transform(x_test)
svm_regress = svm.SVR()
#Fitting the X training and Y training set
svm_regress.fit(x_train, y_train)
SVR()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVR()
y_prediction = svm_regress.predict(x_train)
print(y_prediction.shape)
(1589,)
#Evaluating the model
R_squared = metrics.r2_score(y_train, y_prediction)
print("R_squared: ", R_squared)
Adjusted_R_squared = 1-(1-metrics.r2_score(y_train, y_prediction))*(len(y_train)-1)/(len(y_train)-x_train.shape[1]-1)
print("Adjusted_R_squared: ",Adjusted_R_squared)
Mean_Absolute_Error = metrics.mean_absolute_error(y_train, y_prediction)
print("Mean_Absolute_Error: ",Mean_Absolute_Error)
Mean_Squared_Error = metrics.mean_squared_error(y_train, y_prediction)
print("Mean_Squared_Error: ",Mean_Squared_Error)
Root_Mean_Squared_Error = np.sqrt(metrics.mean_squared_error(y_train, y_prediction))
print("Root_Mean_Squared_Error: ",Root_Mean_Squared_Error)
R_squared: 0.8588189184804081 Adjusted_R_squared: 0.8566524568714119 Mean_Absolute_Error: 2.439039817561648 Mean_Squared_Error: 13.401352384364326 Root_Mean_Squared_Error: 3.660785760511577
plt.figure(figsize=(8, 8))
plt.scatter(y_train, y_prediction, color='green')
plt.xlabel("Actual Life Expectancy")
plt.ylabel("Predicted Life Expectancy")
plt.title("Scatter Plot: Predicted Life Expectancy Vs Actual House Life Expectancy")
plt.show()
Y_PredictionTest = svm_regress.predict(x_test)
# Evaluating the model
R_squared = metrics.r2_score(y_test, Y_PredictionTest)
print("R_squared: ",R_squared)
Adjusted_R_squared = 1-(1-metrics.r2_score(y_test, Y_PredictionTest))*(len(y_test)-1)/(len(y_test)-x_test.shape[1]-1)
print("Adjusted_R_squared: ",Adjusted_R_squared)
Mean_Absolute_Error = metrics.mean_absolute_error(y_test, Y_PredictionTest)
print("Mean_Absolute_Error: ",Mean_Absolute_Error)
Mean_Squared_Error = metrics.mean_squared_error(y_test, Y_PredictionTest)
print("Mean_Squared_Error: ",Mean_Squared_Error)
Root_Mean_Squared_Error = np.sqrt(metrics.mean_squared_error(y_test, Y_PredictionTest))
print("Root_Mean_Squared_Error: ",Root_Mean_Squared_Error)
R_squared: 0.8364568608433269 Adjusted_R_squared: 0.8259339778949083 Mean_Absolute_Error: 2.7033223569668343 Mean_Squared_Error: 14.654081564781793 Root_Mean_Squared_Error: 3.8280649896235817
svm_regress.score(x_test,y_test)
0.8364568608433269
##Holdout Validation
result = svm_regress.score(x_test, y_test)
print("Accuracy using Hold-out Validation: %.2f%%" % (result*100.0))
Accuracy using Hold-out Validation: 83.65%
cutoff = 0.7 # decide on a cutoff limit
y_pred_classes = np.zeros_like(Y_PredictionTest) # initialise a matrix full with zeros
y_pred_classes[Y_PredictionTest > cutoff] = 1 # add a 1 if the cutoff was breached
y_test_classes = np.zeros_like(Y_PredictionTest)
y_test_classes[y_test > cutoff] = 1
confusion_matrix(y_test_classes, y_pred_classes)
array([[398]], dtype=int64)
#define cross-validation method to use
cv = KFold(n_splits=10, random_state=1, shuffle=True)
#build multiple linear regression model
model = LinearRegression()
#use k-fold CV to evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
print("MAE using cross-validation: ",np.mean(np.absolute(scores)))
scores1 = cross_val_score(model, X, y, scoring='neg_mean_squared_error',cv=cv, n_jobs=-1)
#view RMSE
print("RMAE using cross-validation: ",np.sqrt(np.mean(np.absolute(scores1))))
MAE using cross-validation: 2.9791160893707134 RMAE using cross-validation: 3.9597743687204146
Suppose we change our X and y means variables as in our target variable changes from Life Expectancy to Status and we fit the model and compute accuracy, confusion matrix and classficiation report and we use Logistic Regression and Regression both.
X = data[['Country','Year','Adult_Mortality', 'Life_Expectancy',
'Infant_deaths', 'Alcohol', 'Percentage_expenditure', 'Hepatitis_B',
'Measles', 'BMI', 'under-five-deaths', 'Polio', 'Total_expenditure',
'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness_1-19_years',
'thinness_5-9_years', 'Income_composition_of_resources', 'Schooling']]
y = data['Status']
classifier = LogisticRegression()
classifier2 = LinearRegression()
classifier.fit(X, y)
classifier2.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
predictions = classifier.predict(x_test)
predictions2 = classifier2.predict(x_test)
print("Analysis for Logistic Regression")
print(accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
Analysis for Logistic Regression
0.8894472361809045
[[ 33 60]
[ 6 498]]
precision recall f1-score support
0 0.85 0.35 0.50 93
1 0.89 0.99 0.94 504
accuracy 0.89 597
macro avg 0.87 0.67 0.72 597
weighted avg 0.89 0.89 0.87 597
print("Analysis for Linear Regression")
print(accuracy_score(y_test, predictions2.round()))
print(confusion_matrix(y_test, predictions2.round()))
print(classification_report(y_test, predictions2.round()))
Analysis for Linear Regression
0.9279731993299832
[[ 57 36]
[ 7 497]]
precision recall f1-score support
0 0.89 0.61 0.73 93
1 0.93 0.99 0.96 504
accuracy 0.93 597
macro avg 0.91 0.80 0.84 597
weighted avg 0.93 0.93 0.92 597
X = data[['Country','Year', 'Status', 'Adult_Mortality',
'Infant_deaths', 'Alcohol', 'Percentage_expenditure', 'Hepatitis_B',
'Measles', 'BMI', 'under-five-deaths', 'Polio', 'Total_expenditure',
'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness_1-19_years',
'thinness_5-9_years', 'Income_composition_of_resources', 'Schooling',
'discretize', 'label', 'newpredict']]
y = data['Life_Expectancy']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)
regression=RandomForestRegressor()
regression.fit(X_train, y_train)
#prediction
y_predect=regression.predict(X_train)
# Evaluating the model
R_squared = metrics.r2_score(y_train, y_predect)
print("R_squared: ",R_squared)
Adjusted_R_squared = 1-(1-metrics.r2_score(y_train, y_predect))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1)
print("Adjusted_R_squared: ",Adjusted_R_squared)
Mean_Absolute_Error = metrics.mean_absolute_error(y_train, y_predect)
print("Mean_Absolute_Error: ",Mean_Absolute_Error)
Mean_Squared_Error = metrics.mean_squared_error(y_train, y_predect)
print("Mean_Squared_Error: ",Mean_Squared_Error)
Root_Mean_Squared_Error = np.sqrt(metrics.mean_squared_error(y_train, y_predect))
print("Root_Mean_Squared_Error: ",Root_Mean_Squared_Error)
R_squared: 0.994173322601528 Adjusted_R_squared: 0.9940662473660278 Mean_Absolute_Error: 0.4483959429000752 Mean_Squared_Error: 0.5431332103681425 Root_Mean_Squared_Error: 0.7369757189813939
plt.figure(figsize=(8, 8))
plt.scatter(y_train, y_predect, color="purple")
plt.xlabel("Life Expectancy")
plt.ylabel("Predicted Life Expectancy")
plt.title("visualize the difference between the actual and predicted Life Expectancy")
plt.show()
# Predicting the Test data with model
y_test_predect=regression.predict(X_test)
# Evaluating the model
R_squared = metrics.r2_score(y_test, y_test_predect)
print("R_squared: ",R_squared)
Adjusted_R_squared = 1-(1-metrics.r2_score(y_test, y_test_predect))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
print("Adjusted_R_squared: ",Adjusted_R_squared)
Mean_Absolute_Error = metrics.mean_absolute_error(y_test, y_test_predect)
print("Mean_Absolute_Error: ",Mean_Absolute_Error)
Mean_Squared_Error = metrics.mean_squared_error(y_test, y_test_predect)
print("Mean_Squared_Error: ",Mean_Squared_Error)
Root_Mean_Squared_Error = np.sqrt(metrics.mean_squared_error(y_test, y_test_predect))
print("Root_Mean_Squared_Error: ",Root_Mean_Squared_Error)
R_squared: 0.9601944913150153 Adjusted_R_squared: 0.9586804941542553 Mean_Absolute_Error: 1.2820884146341487 Mean_Squared_Error: 3.7783941615853665 Root_Mean_Squared_Error: 1.9438091885741682
regression.score(X_test,y_test)
0.9601944913150153
##Holdout Validation
result = regression.score(X_test, y_test)
print("Accuracy using Hold-out Validation: %.2f%%" % (result*100.0))
Accuracy using Hold-out Validation: 96.02%
cutoff = 0.7 # decide on a cutoff limit
y_pred_classes = np.zeros_like(y_test_predect) # initialise a matrix full with zeros
y_pred_classes[y_test_predect > cutoff] = 1 # add a 1 if the cutoff was breached
y_test_classes = np.zeros_like(y_test_predect)
y_test_classes[y_test > cutoff] = 1
confusion_matrix(y_test_classes, y_pred_classes)
array([[656]], dtype=int64)
data.head()
| Country | Year | Status | Life_Expectancy | Adult_Mortality | Infant_deaths | Alcohol | Percentage_expenditure | Hepatitis_B | Measles | ... | GDP | Population | thinness_1-19_years | thinness_5-9_years | Income_composition_of_resources | Schooling | discretize | label | newpredict | Residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2015 | 1 | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 | 3 | 1 | 71.070164 | -6.070164 |
| 1 | 0 | 2014 | 1 | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 | 3 | 1 | 70.685534 | -10.785534 |
| 2 | 0 | 2013 | 1 | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 | 3 | 1 | 70.300904 | -10.400904 |
| 3 | 0 | 2012 | 1 | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 | 3 | 1 | 69.916274 | -10.416274 |
| 4 | 0 | 2011 | 1 | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 | 2 | 3 | 69.531644 | -10.331644 |
5 rows × 26 columns
#define cross-validation method to use
cv = KFold(n_splits=10, random_state=1, shuffle=True)
#build multiple linear regression model
model = LinearRegression()
#use k-fold CV to evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error',cv=cv, n_jobs=-1)
#view mean absolute error
print("MAE using cross-validation: ",np.mean(np.absolute(scores)))
scores1 = cross_val_score(model, X, y, scoring='neg_mean_squared_error',cv=cv, n_jobs=-1)
#view RMSE
print("RMAE using cross-validation: ",np.sqrt(np.mean(np.absolute(scores1))))
MAE using cross-validation: 2.9791160893707134 RMAE using cross-validation: 3.9597743687204146
Our goal was to understand, observe, and analyze the relationship between the column with the Life Expectancy for different countries over the years and also based on the status whether developing or else developed.